Menu

Python vs SAS: Computing summary statistics (Part 2)

I recently started a series of blog posts to share my work experiences using SAS and Python Pandas for Data Analysis. If you’re coming directly to this post, you can see my first post on Python Pandas vs SAS: head to head data analysis here »

In this part two of the series, I will be using the very powerful Group-Apply-Combine feature in Python Pandas for computing summary statistics and showing the equivalence in SAS as well. Then I’ll let you make your decision on which one is simpler, easier or more flexible.

The Dataset

The dataset used in this article is provided in the introductory part of this series. That’s the more reason you should at least scan through the first part of this series.

SAS Library

SAS has a dataset library that can be viewed as permanent and temporary. The permanent contain dataset that are persistent in the library. The dataset will be there when you quit and restart you SAS program. The temporary library on the other hand is like a worksheet that store transient dataset about the current workspace. We call the temporary library the WORK library while the permanent can have any other name: Sales, Customer, Supplier, etc

The snippet of code below create a new table newsalesemps (new sales employees) into the temporary workspace WORK.

The newsalesemps is one of the tables from the Sales dataset in the Orion Star Sport database. The Orion database is in the permanent library.

Computing Summary Statistics in SAS

In lines 7-10 in the code above, we filtered the Sales table in the Orion database for those sales employees whose salary is greater than $20,000. We the stored the newly filtered table as newsalesemps and saved it in the temporary workspace WORK where we can use it for future analysis.

Lines 13-15 says print out the filtered table and give it this specific title. The code snippet here have been explained in Part I of this series.

Lines 18-22 is a proc statement which says: go into the temporary workspace again, pull out the newsalesemps table, classify the data based on the Job Title and find the means / summary statistics using the variable Salary. Run the proc and print it out with a descriptive title already set in the filtered newsalesemps table.

READ NEXT:  What Most Schools Don't Teach - Feat. Mark Zuckerberg, Bill Gates

SAS Output

sas_summary_statistics

Notice that the minimum Salary is greater than 20,000. Observe also that all the data in the newsalesemps have been classed (or grouped) according to the job title and descriptive statistics about the Salary is provided.

Also note that we did not have to do anything to get a nice little visualization of the table. Talk about magic wand!

Let’s switched gear a bit to discuss how to get the same summary statistics using the nifty Python Pandas.

Computing Summary Statistics in Python

We apply the powerful Group-Apply-combine feature on the Pandas dataframe. First, we filter those employees whose salaries are greater than 20,000 into a new Pandas dataframe:

Python Code:

Python Output:

salary_python

Next, we extract the Salary and Job Title columns from the dataset and apply groupby function. Thereafter, we do an aggregation to get the statistics by using Numpy package.

Python Code:

Output:

sales_stats_python

Do you notice anything in the final tables of both SAS and Python? Besides the sweet visualization of the SAS table,  the results are essentially the same for both SAS and Python with same mean, standard deviation, minimum and maximum salaries.

READ NEXT:  Python Pandas vs SAS: Head to head data analysis (Part 1)

Updated:

I came across a post on healthyalgorithms.com describing a function called unstack. Here’s how I’ve combined it with groupby to get something much more interesting.

Python Code:

Output:

pandas_describe_statistics

Obviously the describe function provides a more detailed statistics than the previous Numpy approach. Note that before applying the unstack() function, the result was aligned vertically – stacked on top of each other – making it a really long output. The unstack() function aligns the results horizontally, just the way we want it.

Summary

This article explores basic usage of the Group-Apply-Combine in Python Pandas and present procedure  on computing summary statistics for both SAS and Python. In the next article, we shall explore the employee demographics and present some nice little visualization to spice it all up.

In the meantime,  let me know which of these codes you find interesting, simpler, clearer and more self-explanatory. Which one do you use? Which one do you prefer? Share your experience with either of these tools in the comment box below.

Cheers!

 

 

By @RichardAfolabi

I'm a thinker, teacher, writer, Python enthusiast, Wireless Engineer, Web geek and a solid Chelsea FC Fan. I'm interested in data science, analytics, visualization and data intelligence. Feel free to get in touch.

  • sas user

    I’m finding these posts and code to be great thanks so much

  • Rob Kellington

    Good Job. It seems the Python solution is equivalent – actually simpler to read – than the SAS toolset.

  • Charles Sutton

    I have limited exposure to SAS and I’ve been digging deep into Python over last year. That being said, I like Python better for two reasons. One, the Python code is written in plain languague so what you think is probably how you write. Second, open source is for me. While I would have to get certifications and pay for training to actually get a SAS job, Python is open source with lots of learning resources therefore I can teach myself the skills I may need for the ‘next’ job without investing up front.