My Foray into the underworld of Data Science – Data Scientist Journey
Data Scientist Journey – Here’s my journey into the underworld of Data Science. Experienced DS-tist reading this would be like “Of Course, Yes!” Newbie would be like “Say whaaat??” But I just say, everybody “chill, it’s just me ranting…”
You hear about the glamor, the money, the “hotness” as the press would say, the high salary, escalated by the rising demand for a unicorn data scientist.
Job descriptions are as varied as the tools Data Scientists use. From specifics like Python, R, Spark, Storm, Hadoop, Numpy, Scipy, Scikit-Learn, Flask, to generalized like machine learning, distributed computing, the list is endless.
Few months ago when I decided to switch field and move into data science, I took a couple of hours to read blog posts, spent time digging around for information on Quora and made a list of the tools I needed to learn:
Top Skills for Data Science/Analytics
- Domain Expertise (ecommerce, Network security, healthcare, etc)
- Programming Skills (Python, R, SAS, Mathematica, etc)
- Machine Learning (K-nearest, Random Forest, Support Vector Machine, etc)
- Data Handling (Data Mining, Data Munging, Data Cleaning, etc)
- Statistical knowledge (Optimization, Hypothesis testing, Summary Statistics, etc)
- Data Management (SQL, NoSQL, MongodB, XML, JSON, Excel)
- One or more Big Data Platform tools (Hadoop, Hive, Spark & Pig)
- Visualization Skills (MaplotLib, D3, Seaborn, Bokeh, Plotly)
Of course, I knew I only needed to know at least one tool in each list but then, that’s surely not an easy task too.
There have been three study materials that have helped me tremendously:
- Andrew NG Machine Learning Course on Coursera
- The book – Python for Data Analysis, by Wes McKinney
- Harvard CS 109 Online Classes,videos, homework and lab session
Visit my Project Portfolio on Heroku
“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it” — Dan Ariely
My First Project – Telecoms Network Analytics
Within a couple of weeks I started churning out codes and posting on GitHub. For my first comprehensive project, I analyzed the US long haul fiber optics infrastructure connectivity such as router-level topology, switches, routers, and base station and their installation in different geographic locations.
- 20 Service Providers (Verizon, AT&T, Sprint, Tata, CenturyLink, Cogent, etc)
- 273 Nodes (Routers, Switches, cell towers/Base stations, etc)
- 542 Fiber nodes (Edges connecting the nodes)
Using graph theory techniques, I model the dataset as a collection of nodes and edges. The goal was to examine the dataset, visualize network infrastructure distribution with the US, visualize source-sink relationships within the states and cities, analyze the inbound / outbound connections and determine how populations or other parameters may influence or affect network resource utilization.
For example, what characteristics do we see in telecoms infrastructure deployed in San Francisco compared to Utah? Are there some states or cities that can share network resources if a nearby city’s resource is over-stretched?
Can telecoms network share resources effectively? Can we predict optimal resource sharing for telecom operators like AT&T, Sprint, etc?
I packaged the solution into a Python web application using Flask and Bokeh and published on Heroku as telecoms network analysis app.
In contrast with my second project which I’ll talk about in a moment, one simple lesson I learnt in this first project is that, data analysis is fun and easy when your dataset play “nice”.
If you’re working with the MNIST or Iris dataset, you are living in heaven. If you’ve not tried your hands on dirty, messy, clunked data, do not flex your muscle yet, else you’ll be living in fool’s paradise!
My second project showed me what most well-versed data scientists already recognized. Analytics is the fun part of your project but it only just about 15% of your actual workload.
A large chunk of a data scientist’s time will be spent on data gathering, preparation and understanding the data. I discovered this in my second project.
My Second Project – Network IP Intelligence
My first project was fun but this one right here is kinda brutal!
What made it brutal was the dataset. It is one messed up large dataset. It’s a 4.5GB data consisting of 44 millions rows with each row containing only 4 columns : Date, MD5Hash, Domain, IP Address. I should give you an idea of what the data mean.
We know that banks, insurance and credit card companies are leaning heavily towards online security and fraud detection using data science and analytics tools.
An IP address is one of the most effective means of identification because of the underlying information it contains. From an IP address, we can determine several geolocation information about the user and determine if the IP has already been blacklisted.
Because IP Intelligence reveals IP address and geographic location of website traffic, we can:
- Enable targeted online advertising
- Prevent fraudulent transactions
- Present content on location basis (e.g. forms, agreements, contracts, TOS, etc)
- Provide website analytics in real time
The dataset contains 36 months of passive Domain Name System (DNS) data produced by executing suspect (malicious) Windows executables for a short time. On each execution, interaction with the DNS is recorded into 4 columns of Date, MD5Hash, Domain, IP Address which forms our 44million rows of data.
Can you see the problems here? All our data, except the date object are strings! Some are as long as 255 characters while several other rows are missing either of Domain or IP Address.
Unlike the Iris, MNIST or other famous dataset, this IP problem is interesting but the data isn’t. In fact, looking at the data is just plain depressing. Read the project description here »
Repeat after me: if you have categorical [string] data you’re pretty much screwed!
I had to do something. I know the most important column here is the IP address.
From my experience in web development, management and maintenance, I know about whois query for IP addresses that can give me more information like Hostname, City, country, Company, ISP, Long & Lat and UserAgent information about the IP.
But, I’ve got 44million rows of data!
It seems I need a geolocation API that can take IP addresses as input parameter and spew out as much information as possible about the IP.
I was pumped!
Hello, Data Scrapping!
Oh oh, but here comes the kickers…
- Most API services have limits of 1,000 query daily limits with the most generous being 10,000 before which you’re blocked or error 403, 404 start knocking your response.
- They’re paid services averaging $35 monthly for 10,000 queries. Quick mental calculations 44million needed queries / (10,000 daily limits x 30 days) => 146 months before I can gather enough data for analysis! WTF!
- If I use multiple services, I can shorten the duration but then API responses will be different and I may have to write a script to merge those data. I also have to divide the dataset into different .csv file in a format that each API service can accept. Oh Lawrd, this is getting twisted!
- I have to spend some money on API anyway.
I needed up using 3 API free and paid services, wrote cronjob script that automates the data fetching process, formats the returned JSON as pandas dataframe, and saves it in batches of .csv file. I also wrote a cronjob that automatically pushes my update to GitHub.
This data gathering and preparation process took two weeks!
I ain’t started data analysis yet my friend. We’re just sourcing, searching and scraping for data.
I’ve not even done any data cleaning yet. Neither do I know what kind of analysis I can do with the returned data. How to prepare the data for exploratory data analysis (EDA) is still not clear yet.
Oh wait, are you seriously asking me about machine learning for predictive analysis?
Please hold your horses, I ain’t thinking of that yet till I’ve figured out what my data says through detailed EDA.
At at the time of writing this post, I’m in the fourth week of this project.
Those 44 million rows now come down to about 20million rows of good, clean useable data of about 2GB having almost 17 columns of data attributes.
Keep an eye on my project project portfolio