Author Archive

100 days of ML Code – day #1 notes

March 5, 2019 Leave a comment

Having done the deep learning nanodegree from Udacity was a great start for understanding and practicing neural networks, but if you are not working actively in the domain, you get rusty… Therefore I decided to take the challenge of 100 days of ML code. If you never heard about the challenge look at this YouTube video from Siraj Raval. In short it means “coding and/or studying machine learning for at least an hour everyday for the next 100 days. Pledge with the #100DaysOfMLCode hashtag on your social media platform of choice“. A well know repository for this challenge is the one from Avik Jain that can be found here.

So I did day #1 and below are my notes from this first day, which include setup and corrections of the code as well as some questions and answers.

Setting up your environment

This environment is valid for starting our journey and will involve Jupyter notebooks and other python libraries (scikit learn, pandas and numpy).

I use anaconda to manage my python related environment, so if you don’t have it yet, get it and install it!

Once installed open an “anaconda prompt” and write the following command to define a new environment for the challenge.

conda create -n 100daysofmlcode python=3 jupyter notebook pandas scikit-learn numpy

This will create an environment named 100daysofmlcode and install the required dependencies. Once done type “activate 100daysofmlcode” to activate the environment.

Next clone the repository in some folder, then from the same anaconda prompt navigate to the folder and start jupyter notebook (just type jupyter notebook from the anaconda prompt). You are now ready to start day 1.

Corrections to the notebook

Since we used the latest libraries of numpy, pandas and scikit learn, we need to update the original API from the repo to suite the API changes.

In “Step 3: Handling the missing data”, modify the following

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)


from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = py.nan, strategy = "mean")

In “Step 5: Splitting the dataset into training sets and Test sets”, change

from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)


from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

Run the cells, you have now successfully done day #1.

Label encoder vs One hot encoding

If you look carefully into the code, you’ll notice that we use a label encoder for mapping the result label (‘yes’/’no’) to numbers while we use one hot encoding to map the country. The question is why? Why a different encoding scheme?
The reason is that you don’t want the algorithm to draw a relation between the number representing the countries, or at least you don’t want the algorithm think there is an ordered relation between the numbers (e.g. that France represented by 1 (for example) is lower than Spain represented by 2) and therefore we use a separate column for each country. If you want the full answer look at this article: label encoder vs one hot encoder in Machine learning

That’s all! Happy day 1/100 days of ML code!


The 5th annual Henry Taub TCE conference

June 2, 2015 1 comment

Today I attended to the first day of the 5th TCE conference, this year topic was “Scaling Systems for Big Data“. There were some nice lectures, especially the first one which was the best of today.

This lecture was from the Software Reliability Lab a research group in the Department of Computer Science at ETH Zurich led by Prof. Martin Vechev, who presented the lecture. The topic was “Machine Learning for Programming” where machine learning is used on open source repositories (github and alike) to create statistical models for things that were once “science fiction” like – code completion (not a single word or method but full bunch of code into a method), de-obfuscation (given an obfuscated code you’ll get a nicely un-obfuscated code with meaningful variable names and type) and others…. This is a very interesting usage of machine learning and perhaps soon we (developers) may be obsolete 🙂
Some tools using this technique – which shows de-obfuscation of javascript code and the framework on top is built jsnice.

Few facts from a short google talk on building scalable cloud storage:

  1. The corpus size is growing exponentially (nothing really new here)
  2. Systems (“cloud storage systems”) require a major redesign every 5 years. That’s the interesting fact… Let remember Google had GFS (Google file system – which HDFS is an implementation of it), then Google moved to Colossus (in 2012) so according to that in 2017 should we see a new file system? If so they certainly work on it already….
  3. Complexity is inherent in dynamically scalable architectures (well nothing new here too)

If you are interested in mining and checking MS Excel files for error and suspicious values  (indicating that some values might be human error) then might be the solution for you. What about survey? Can survey have errors too? Well it seems that same question presenting in different order will produce different results (human are sometimes really non logical) so if you have a survey and want to check if you inserted some bias by mistake then surveyman is the answer. You can refer to Emery Berger’s (who gave the talk) blogs for cellckeck and surveyman ( and  respectively)

Another nice talk from Lorenzo Avisi (UT Austin) about SALT. A combination between the ACID and BASE (in chemistry ACID + BASE = SALT) principle in a distributed database. So you can scale a system and still use relational database concept instead of moving to a pure BASE databases which increase the system complexity. The idea is to break relational transactions into new transaction types having better granularity and scalability. The full paper can be found here

By the way if you are using map reduce an interesting fact from another talk by Bianca Schroeder from Toronto University (this is a starting paper is) that long running jobs tend to fail more often that short ones and retrying the execution more than twice is just a waste of cluster resource because it will almost for sure fail again. By using machine learning the research team  is able to predict after 5 minutes of run the probability of failure of the job or not. The observation were done on google cluster and open cluster too. This is for sure a nice future paper…

Increasing disk size on Hadoop Cloudera’s VM

December 17, 2012 2 comments

Cloudera is giving a nice solution if you want to play with hadoop ecosystem (and other cloudera’s add-ons), which is a virtualized single-node Hadoop cluster. The VM is available for VMWare, KVM and Virtual Box and can be downloaded from Cloudera download site

Lately I faced the problem that the VM predefined size was not enough (25GB) and I needed to increase the disk space, something that sound trivial, cost me several hours to figure out how to manage (especially when you are not an Linux admin, and when the graphical user interface of the virtualized guest OS is missing some system functionality).

So below are the instruction for that. I’ll show how to increase the disk size from 25GB to 100GB on VMWare image (using VMWare Player), of CDH 4.x

  1. In the VMWare player, when the cloudera image is shutdown, go to the VM settings, Hardware tab, select Hard Disk devices select the utility button and the expand option. In the dialog set the new size (here 100 GB), press ok
    clouderavm (increase vm disk)_step1
    Once completed (this operation will take several minutes, depending on the disk size), a new popup will inform you that the virtualized disk size was increase but you need to modify your guest OS to use the new size. In order to do so we will need to perform several admin operation in the guest OS which is in our case CentOS linux distrib.
  2. The next step is to modify the boot option to start the guest OS without any services and no graphical UI, we will in fact start the guest OS as run level 1 (single user mode). So start the VM, and at the boot screen press any key to enter the GRUB boot manager (you will have 3 seconds to do so). When the GRUB nenu shows, go to the entry for cloudera demo vm, press ‘e‘ to edit the entry, go to the kernel entry and press again ‘e‘, then add ‘1‘ at the end of the line, press enter then ‘b‘ to boot to the newly modified option.
    clouderavm_(increase VM disk) step2a clouderavm (increase vm disk)step2
  3. Once booted, login as root (password is cloudera), check the disk size using ‘df -h’ you will see that the disk size is still 25GB, using ‘fdisk -l’ shows the physical disk and it’s allocated partition, we can see that the physical disk (/dev/sda) already reflects the increase of the size but not the partition (/dev/sda1) .
    We will change this using fdisk and resize2fs commands
  4. So a the prompt type: ‘fdisk /dev/sda‘, we will delete and recreate the partition, pay attention that the newly created partition needs to start at the same size than the one we delete, in order to note its starting point in fdisk prompt press ‘p‘. In the previous screenshot /dev/sda1 started at 1.
    • in fdisk prompt press’d‘ to delete the partition, since there is only one partition it will be automatically deleted.
    • press ‘n’ to create a new partition, then ‘p‘ for primary partition, then ‘1‘ for the partition number, put your previous starting number or press ‘enter‘ to pick the default, put the end size or press ‘enter‘ to pick the default max size (here 100GB)
    • at this stage a new partition /dev/sda1 should have been created.
    • type ‘w‘ to write the file partition change to disk.
    • quit fdisk using ‘q‘ command
  5. reboot as before (you can use ‘reboot‘ command for that) and at the start screen proceed like in step #2 to boot as run level 1), login as root again. Now using ‘df -h’ will still show a disk size 25GB, but using ‘fdisk -l‘ will show that the new partition as a size of 100GB as shown below.

    In order to resolve this, at the command prompt type ‘resize2fs /dev/sda1‘. Once the resize ends, your increased disk space should be reflected via ‘df -h‘, below 99GB.

That’s it. You can now reboot as usual and enjoy your increased disk space.


Pig and Hbase integration

September 15, 2011 1 comment

The Hadoop ecosystem contains a lot of sub project. Hbase and Pig are just some of them.

Hbase is the Hadoop database, allowing to manage your data in a table way more than in a file way.

Pig is a scripting language that will generate on the fly map reduce job to get the data you need. It is very compact compared to hand writing map reduce job.

One of the nice thing between Pig and Hbase is that they can be integrated. Thanks to recent patch committed.

The documentation is not well updated yet (currently almost relate to the patch itself) some can be found on some post like here but they all lack of details explanation. Even the Cloudera distribution CDH3 indicates support for this integration but no sample can be found.

Below I describe the installation and configuration steps to make the integration works, provide and example and finally expose some of the limits of the current release (0.8)

  1. First, install the map reduce components (Job tracker and Task tracker). One Job tracker and many task tracker as you have data nodes. Each distribution may provide different procedure for the installation, I’m using the Cloudera CDH3 distrib, which for the map reduce installation is well documented.
  2. Now proceed with the Pig installation,  it is also easy as long you are not trying the integration with Hbase.  You need only to install pig on the client side, you do not need to install it on each Data Node neither on the Name Node, but just on the machine where you want to run the pig program.
  3. Check your installation by entering the the grunt shell (just enter ‘pig’ from the shell).
  4. Now the tricky part – In order to use Pig/Hbase integration you in fact need to make Map Reduce jobs aware of Hbase classes, otherwise you will have “ClassNotFoundException” or worst the zookeeper exception like “org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase” during execution. The way to perform this easily without coping the hbase configurations into your hadoop configuration dir, is by using and hbase to print its own classpath.
    So add to your file file the following

    #define the location of hbase
    export HBASE_HOME=/usr/lib/hbase
    #Customize the classpath to make map reduce job aware of Hbase
    export HADOOP_CLASSPATH="`${HBASE_HOME}/bin/hbase classpath`:$HADOOP_CLASSPATH"
  5. You will also need pig to be aware of Hbase configuration, for this you can use the HBASE_CONF_DIR environment variable (for CDH release), which is configured by default to be /etc/hbase/conf,

Ok your installation should be fine now, so let’s do an example…. For this example let assume we have stored in HBase a schema named TestTable, and column family named A, we have also several fields named field0, field1,…, and we want to extract this information and store it into ‘results/extract’.  In this case the pig script will looks like:

my_data = LOAD 'hbase://TestTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('A:field0 A:field1', '-loadKey=true') as (id, field0, field1);

store my_data into 'results/extract' using PigStorage(';');

So the above script indicate that the my_data relation will contains the fields “field0, field1” and the ID (due to the -loadKey parameter). These fields will be stored as id, field0, field1 under the ‘result/extract’ folder and values will be separated by semicolon.

You can also use some comparison operator on the key. The current operator supported are lt, lte, gt, gte for lower than, lower than or equal, greater than and greather than or equal.

 Note: There is no support for logical operator, you can use more than one comparison operator which are chained as AND.


The current HBaseStorage, does not allow the usage of wildcard, that is if you need all the fields in a row, you need to enumerate them. Wildcard are supported in version 0.9.

You can use HBaseStorage to store back the records in HBase nevertheless the HBase usage is incosistent a bug was already opened on this.

The 1st Technion Computer Engineering (TCE) Conference – Day #2 (June 2, 2011)

June 5, 2011 Leave a comment

This year the Technion, (Israel institute of technology) held the “The 1st Technion Computer Engineering (TCE) Conference (June 2, 2011)“, I registered myself to the second day (Thursday 2, June). This day topics were on “Computer Architecture & Systems” and lot of lecturer both from academic and industry were present, among them I had the pleasure to listen to Leslie Lamport (from Microsoft), well known for his “Paxos algorithm” paper (see resources) and really nice other lecturers like Yale N. Patt (Texas University).

Below are some of my impression for this day

  • Moore Law seems to reach limit – or at least it is not sure that the race to double the number of transistor each 18 months still the primary goal of the CMOS industry (too much core we barely need so much)
  • Parallel programming, is one of the major topic that should be learn, and we should stop creating the “its complicated” barrier on the topic
  • Security on the cloud data seems to have a major attention both by Microsoft and Intel
  • Theory (Mathematics) and Hardware knowledge are important to achieve performant programming.


Categories: Programming Tags: ,

Factors affecting C++ Compilation time – How to reduce them

August 11, 2010 4 comments

Well, I never figured out to myself that I will write a C++ article when my main specialization is Java. Anyway during the last three years I’m involved in a cross discipline project involving JAVA and C++.

In this project a JAVA generator generates millions of lines of C++ code which of course have to be compiled, and if you are a C++ guy you certainly already have your hair standing on your head because of the time it will require to compile such huge amount of code. Well you are right, we faced extremely long compilation time (12+ hours in Unix, in windows…), which are major problem in a product that should have a quick time to market.

Worst, the product is used both under Windows and Unix platforms, which means that a solution need to be created for both worlds.

Under Windows – even with the usage of Incredibuild from Xoreax (a great grid compiler platform which allowed to reduce considerably the compilation time), the user still needed to wait 2 hours for compilation, which was not acceptable.

Under Unix – no grid compiler (unless you work only on few platform), we tried to use distcc but the results were still not satisfying and you need additional hardware. We where stuck…

Therefore we begun a research for an alternative that could speed up the builds, and for that we needed to understand the factor that affected compilation time, our main suspicious factor was the number of lines of code to compile, since the code was generated, it was very easy to inflate the output nevertheless we soon understood that we were wrong…

Below are factors impacting the compilation time (according to their impact)

  1. In the first place the number of files to compile – this is one of the major factor affecting compilation time, the compiler is not really smart at reusing information it processed between invocation and is not able either to work on a set of files, and it is especially slow (I/O bound) for building dependencies. If you want to really reduce the compilation time reduce the number of file to compile, and it does not mean to write all in a single file, you can use what is called Unity Build.
    A Unity Build group several cpp files in a single one using just the include directive. For example let say you want to compile file1.cpp to file 10.cpp then create a new file group.cpp as follow:

    #include "file1.cpp"
    #include "file2.cpp"
    #include "file3.cpp"
    #include "file10.cpp"

    Now compile group.cpp and don’t forget to add the file1.cppfile10.cpp location as include path.
    This method produces miracle (of course you have to balance the number of files you put in a single group/unity). Our compilation reduced from 15 hours to just 2.

  2. Include paths – large number of include path directly affect the build time, since the compiler (or pre-compiler) need to scan all the path until it find the requested include. So try to minimize them or at least organize the path list according to the most searched one.
  3. NAS (Network Attached Storage) also has a bad impact on the compilation (write is usually fast, but read is slow so library creation is slow).
  4. Generate cppdep and compile on the same time – unix compilers support option to create cppdep file and compile at the same time you can save approx 20% of your compilation time.
  5. Forward declaration, also know as the “Pimpl idiom” to reduce dependency, greatly help, the problem is that you cannot always refactor the code, to avoid some include that will erase your effort.
  6. Usage of template – using C++ template excessively increase compilation time and libraries volume (especially if the template is declared in header).
  7. Number of strings constants in a single file. It might should strange, but some compilers (HP and Sun at least) have a performance degradation when the compilation unit contains an large number of string (few thousand).
    Note: Visual Studio compiler is not sensitive to this factor.
  8. Generic vs inflated code – using call to function or writing the content of function where you need them (like a forced “inline”). Inlining function in this way may produce better performance, but does not affect compilation time, as much as you think it affect. Effectively we reduced millions of lines of code by 75% using call to function instead of inlining their content, but we got no improvement in build time, but at least you gain more maintainable/debugable code.
  9. Usage of pre-compiled header might help, but from our test they did not, the compilation time was in fact increased.
  10. Usage of header cache folder – similarly to pre-compiled headers, should help (according to vendors) but from our test they most of the time do not.

So if you really want to reduce your compilation time try Unity Build concept, you will gain in:

  1. Faster build time
  2. Smaller objects size
  3. Smaller libraries size
  4. Better optimized code

Note: The compilation time is related to the number of cpp files in a single Unity Build (and their dependencies), and this number should be tuned according to the included file content (inline, template usage, headers used…). If you have too much file in a single unity/group file compilation time increase back (still better than when no using group file), nevertheless the library size declines (even if the compilation time increased back).


Setting up classpath from jar file

May 27, 2010 Leave a comment

Jar files (JAva aRchives) are very convenient containers, you can pack all you need for your application (at least for classes and resources), put the jar on the target environment and just run java -cp <myapp.jar> <appMain> <command line args> to execute your program.

With a jar file you don’t need scripts or long command line to setup your classpath for execution.  Nevertheless if you can do better than configuring the classpath and the main from command line, you can use the manifest file for this. Doing so, you can just type java -jar <command line args>

The manifest is a text file (property like) containing information on the archive, as part of this information you can define the main class of the archive and define the classpath (as long you did not pack other jar too)

In order to do so, define in the manifest the following tag ‘Class-Path’ and ‘Main-Class’. Following is a sample:

Main-Class: sample.package.MyMain
Class-Path: directory-one/sub-directory-one/referenced.jar directory-two/

Keep in mind that:

  1. You specify several directories and/or referenced jar using a space as delimiter
  2. Reference to directories and other jars are relative to the jar
  3. Any referenced jar using the Class-Path attribute cannot be present in your original archive (without special classloader)
  4. If you have resource in some directory don’t forget the slash at the end otherwise the content of the directory is not seen.


JAR specification

Categories: JAR, Java Tags: , ,