Archive for the ‘Uncategorized’ Category

100 days of ML Code – day #1 notes

March 5, 2019 Leave a comment

Having done the deep learning nanodegree from Udacity was a great start for understanding and practicing neural networks, but if you are not working actively in the domain, you get rusty… Therefore I decided to take the challenge of 100 days of ML code. If you never heard about the challenge look at this YouTube video from Siraj Raval. In short it means “coding and/or studying machine learning for at least an hour everyday for the next 100 days. Pledge with the #100DaysOfMLCode hashtag on your social media platform of choice“. A well know repository for this challenge is the one from Avik Jain that can be found here.

So I did day #1 and below are my notes from this first day, which include setup and corrections of the code as well as some questions and answers.

Setting up your environment

This environment is valid for starting our journey and will involve Jupyter notebooks and other python libraries (scikit learn, pandas and numpy).

I use anaconda to manage my python related environment, so if you don’t have it yet, get it and install it!

Once installed open an “anaconda prompt” and write the following command to define a new environment for the challenge.

conda create -n 100daysofmlcode python=3 jupyter notebook pandas scikit-learn numpy

This will create an environment named 100daysofmlcode and install the required dependencies. Once done type “activate 100daysofmlcode” to activate the environment.

Next clone the repository in some folder, then from the same anaconda prompt navigate to the folder and start jupyter notebook (just type jupyter notebook from the anaconda prompt). You are now ready to start day 1.

Corrections to the notebook

Since we used the latest libraries of numpy, pandas and scikit learn, we need to update the original API from the repo to suite the API changes.

In “Step 3: Handling the missing data”, modify the following

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)


from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = py.nan, strategy = "mean")

In “Step 5: Splitting the dataset into training sets and Test sets”, change

from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)


from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

Run the cells, you have now successfully done day #1.

Label encoder vs One hot encoding

If you look carefully into the code, you’ll notice that we use a label encoder for mapping the result label (‘yes’/’no’) to numbers while we use one hot encoding to map the country. The question is why? Why a different encoding scheme?
The reason is that you don’t want the algorithm to draw a relation between the number representing the countries, or at least you don’t want the algorithm think there is an ordered relation between the numbers (e.g. that France represented by 1 (for example) is lower than Spain represented by 2) and therefore we use a separate column for each country. If you want the full answer look at this article: label encoder vs one hot encoder in Machine learning

That’s all! Happy day 1/100 days of ML code!


The 5th annual Henry Taub TCE conference

June 2, 2015 1 comment

Today I attended to the first day of the 5th TCE conference, this year topic was “Scaling Systems for Big Data“. There were some nice lectures, especially the first one which was the best of today.

This lecture was from the Software Reliability Lab a research group in the Department of Computer Science at ETH Zurich led by Prof. Martin Vechev, who presented the lecture. The topic was “Machine Learning for Programming” where machine learning is used on open source repositories (github and alike) to create statistical models for things that were once “science fiction” like – code completion (not a single word or method but full bunch of code into a method), de-obfuscation (given an obfuscated code you’ll get a nicely un-obfuscated code with meaningful variable names and type) and others…. This is a very interesting usage of machine learning and perhaps soon we (developers) may be obsolete 🙂
Some tools using this technique – which shows de-obfuscation of javascript code and the framework on top is built jsnice.

Few facts from a short google talk on building scalable cloud storage:

  1. The corpus size is growing exponentially (nothing really new here)
  2. Systems (“cloud storage systems”) require a major redesign every 5 years. That’s the interesting fact… Let remember Google had GFS (Google file system – which HDFS is an implementation of it), then Google moved to Colossus (in 2012) so according to that in 2017 should we see a new file system? If so they certainly work on it already….
  3. Complexity is inherent in dynamically scalable architectures (well nothing new here too)

If you are interested in mining and checking MS Excel files for error and suspicious values  (indicating that some values might be human error) then might be the solution for you. What about survey? Can survey have errors too? Well it seems that same question presenting in different order will produce different results (human are sometimes really non logical) so if you have a survey and want to check if you inserted some bias by mistake then surveyman is the answer. You can refer to Emery Berger’s (who gave the talk) blogs for cellckeck and surveyman ( and  respectively)

Another nice talk from Lorenzo Avisi (UT Austin) about SALT. A combination between the ACID and BASE (in chemistry ACID + BASE = SALT) principle in a distributed database. So you can scale a system and still use relational database concept instead of moving to a pure BASE databases which increase the system complexity. The idea is to break relational transactions into new transaction types having better granularity and scalability. The full paper can be found here

By the way if you are using map reduce an interesting fact from another talk by Bianca Schroeder from Toronto University (this is a starting paper is) that long running jobs tend to fail more often that short ones and retrying the execution more than twice is just a waste of cluster resource because it will almost for sure fail again. By using machine learning the research team  is able to predict after 5 minutes of run the probability of failure of the job or not. The observation were done on google cluster and open cluster too. This is for sure a nice future paper…