Menu

Data Science

Python Pandas vs SAS: Head to head data analysis (Part 1)

Like every other data scientist out there, one of the questions I asked myself recently is “what programming language or data analytics tools should I learn to become a good data scientist?” The answer to this probing question is as varied as the varieties of potential tools. Interestingly, same languages often emerge in the top 3 depending on which platform (LinkedIn, Indeed, StackOverflow, Reddit, etc) you got your data from. Recently, RJMetrics published a comprehensive article which I found […]

Understanding Clustering for Machine Learning

Since we’re talking about Clustering for Machine Learning,  let’s start by understanding what we mean by Cluster. The clustering process is based on some pre-defined criteria. Clustering is an unsupervised learning technique in which there are no predefined classes and no examples or prior information. demonstrating how the data should be grouped (or labeled) into separate classes. Clustering could also be considered an Exploratory Data Analysis (EDA) process which allows us to discover hidden pattern of interest […]

Non-technical intro to Random Forest and Gradient Boosting in machine learning

A collective wisdom of many is likely more accurate than any one. Wisdom of the crowd – Aristotle, 300BC- The concept of Ensemble is fundamental to many areas of our lives. A choir  of singers is an ensemble. A band of instrumentalists is an ensemble. A group of vocalists singing different notes (Bass, alto, tenor, sopranos) is an ensemble. A group of kids singing melodious acapella is an ensemble . You can already see the trend right? In […]

Python vs SAS: Employee demographics analysis & plots (Part 3)

This is the third part of my exploratory Python Pandas vs SAS data analysis where I present both Python and SAS codes performing the same functions. I provided the justifications for this work in Part I while I performed fundamental summary statistics in Part II using the Group-Apply-Combine feature of Pandas. In this part III of the series, we shall be performing an employee demographics analysis within the Sales department of Orion Sports Star. More importantly, we shall be using another powerful […]

Python vs SAS: Computing summary statistics (Part 2)

I recently started a series of blog posts to share my work experiences using SAS and Python Pandas for Data Analysis. If you’re coming directly to this post, you can see my first post on Python Pandas vs SAS: head to head data analysis here » In this part two of the series, I will be using the very powerful Group-Apply-Combine feature in Python Pandas for computing summary statistics and showing the equivalence in SAS as well. Then I’ll […]