Menu

Python Pandas vs SAS: Head to head data analysis (Part 1)

Like every other data scientist out there, one of the questions I asked myself recently is “what programming language or data analytics tools should I learn to become a good data scientist?”

The answer to this probing question is as varied as the varieties of potential tools. Interestingly, same languages often emerge in the top 3 depending on which platform (LinkedIn, Indeed, StackOverflow, Reddit, etc) you got your data from.

Recently, RJMetrics published a comprehensive article which I found really exciting and informative. In the top 3 are (drumroll…): R, Python and SQL! One other language which I found really exciting on the list is SAS which is the #1 license-based data analysis tool.

Understanding that Data Analysis is permeating every industry and every data scientist must be ready to learn new tool and be very adaptive to existing tool in use in any organization they join, I have decided to learn and consistently use SAS along with Python since they two of the most advanced analytic tool in the industry.

If you’re a student or faculty interested in learning or using SAS, you can download the full SAS University edition software or use SAS Studio in the cloud for free.

Python Pandas vs SAS – IMO

While I’ve tried to learn R, I feel more comfortable with Python and SAS. SAS is intuitive and gives me hundreds of datasets pre-bundled in the library. It’s also fantastic for generating quick visualization and does a lot of processing under the hood.

However Python Pandas is different. It’s powerful, flexible, gives you the power of a superman and makes you feel like a programmer.

Compared to SAS, you must understand how things work before you can use Python Pandas effectively.

Getting Started

Okay, this article is getting fr3aking too long than I anticipated. So I’m going to divide it into different sections and multiple pages for easy reading.

Study Outline

The Dataset

We’ll be importing a sample dataset from SAS dataset library as .csv file and then import the dataset into Python Pandas. We use a fictitious dataset from Orion Star Sports dataset from the SAS library. Orion operates a tradition store, online store and a large catalog business.

The company HQ is located in United States with many other locations throughout the world. With 1,000 employees and 90,000 customers, the company processes over 150,000 orders annually and has over 64 suppliers. So as expected, Orion Star has huge dataset about customers, suppliers, products and employees.

We’ll work with these data to create various business analytics tasks in this course. We’ll analyze the dataset using both SAS and Python languages. For each step in the analysis, we’ll show the Python and SAS code along with some explanations and discussions of the different approaches.

Data Pre-processing

Downloading dataset from SAS library is somehow surgical. This shouldn’t be the case for expert SAS users but for those starting out, you need intuition and grits to figure out the process.

Here’s to successful data grab from SAS… Cheers!

I always use Brackets.io editor to do a quick preview of my dataset (or the good ‘ol MS-Excel!) before I attempt to do anything on the file. It’s the best text editor that I know. It easily read my .dat, .csv, and even binary image files!

 

Import the dataset

Use Pandas to import the dataset. Then check the datatypes of each field for integrity. We should get int64 for integer variables and object for object variables. Observe that I have uploaded the  CSV dataset file to my account on github and you can also directly copy-paste the code to your ipython to grab the file and run the code.

Code:

Output:

Now we know the data structures and have confirmed successful import of SAS dataset to Pandas dataframe. Let’s view the first 5 records in the sales table and dataframe and confirm they’re same in both in Pandas and SAS.

Python Code:

SAS Codes:

In the next article, I will give more explanation on this code snippet. But for now, simply look at the output generated below:

Output SAS:

orion.sales.head

 

Output Python:

python_head

One simple preprocessing step is to inspect the last 5 records in the sales database. This step also allow us confirm that both Python and SAS have the same number of elements in the imported dataset. In this case, we should have 165 records.

 

Code:

Output SAS:

orion.sales.tail

Output Python:

python_tail

Notice that the records are same. Since Python list is zero based, we have 0-164 while SAS in not zero-based.

 

Summary

So there you have it. Simple comparison between SAS and Python Pandas. In the next article, I will show how to compute a descriptive or summary statistics using Group-Apply-combine in Python and similar technique in SAS.

 

By @RichardAfolabi

I'm a thinker, teacher, writer, Python enthusiast, Wireless Engineer, Web geek and a solid Chelsea FC Fan. I'm interested in data science, analytics, visualization and data intelligence. Feel free to get in touch.

  • Melodie Rush

    Thanks for the great comparison. I look forward to your next article.
    Hope this helps you as you are learning SAS. This code is more efficient for printing the first 5 rows using SAS 🙂
    proc print data=orion.sales(obs=5);
    run;