Data Science

Must Know “Python Libraries” For Data Science

A Practical Guide To Python Libraries

Vishnu Arun

--

Created by Author

Have you ever wondered why Python dominates all other programming languages in the field of data science and machine learning? Well, there are a plethora of contributing factors, some of them being:

  1. Welcoming Community: Python has a large community of skilled programmers that go out of their way to assist newcomers to get started, overcome difficulties, and quickly advance their learning curve!
  2. Easy to Understand: Python’s ease of reading and the simplicity of its syntax makes it particularly useful for performing operations on massive datasets and building complicated machine learning algorithms without having consciously focused on syntax.
  3. Versatility: Python empowers developers to perform a wide range of tasks, right from data mining, data analysis, training machine learning models, and web deployment, making it one of the few languages capable of doing so!
  4. Extensive Collection Of Libraries: Just think of any task you want to implement in a programming language, and there’s a high chance there’s already a Python library for it! Apart from making the implementation a lot easier, it also saves the developer a ton of time. Without exaggerating, it is fair to say that these libraries are at the heart of what binds the Python community together!

So without further adieu, let us deep-dive into some of the most important Python libraries needed by a data scientist. Let us follow the typical workflow of a data science problem and look at the libraries used in each stage to make things even more interesting!

Created By Author

1) Data Collection

Every data science problem begins with a data collection centered on the problem statement. Although we will most likely be provided a dataset, it is always a good idea to have your own data collection expertise.

Python includes a number of packages that will assist us in scraping data from websites. It should be noted that many commercial companies do not allow data to be scraped from their websites. As a precaution, you may want to read their terms and conditions or risk being blocked from accessing the site.

1.1) Requests & Beautiful Soup

Using Requests and Beautiful Soup, users can scrape data off websites. However, there’s a drawback to using these libraries, i.e. we cannot scrape data off dynamic webpages.

Now, let’s take a look at an example of the same

If the above code returns a status code that lies between the range of 200299, it means that the HTTP request was successful.
Read this article to get a picture of all the different HTTP status codes.

Let’s print the first 300 characters of the response we got;

We can now use Beautiful Soup to extract specific tags/ parts from the data that we scraped. For this example, let us extract specific <div> tags from the data that give us an idea of the topics that our quotes belong to:

1.2) Selenium

Selenium is a library that addresses one of the primary shortcomings of the aforementioned libraries, i.e. the inability to scrape dynamic webpages.
It is also far more powerful in terms of performance and allows the user to automate recurring operations.

To get a complete grasp of utilizing selenium; I would recommend taking a look at this comprehensive tutorial by Jovian.

2) Data Preprocessing / Data Cleaning

The next stage after collecting data is to clean it. This involves steps like creating a data frame, understanding the dataset, imputing null values, finding and removing duplicate data, adding clean column headers, etc.
Numpy and Pandas are two popular libraries for getting this done.

2.1) Numpy

Numpy, aka “Numerical Python,” is a library that is primarily utilized by academics and researchers because of its extensive math capabilities and superior speed of operations. It supports operations on arrays, allows us to generate random values, work with n-dimensional arrays, etc.

Now let us look at an example of Numpy array operations:

  • Check this notebook and solve 100 Numpy questions to get a solid grasp of this Python library.
  • Do read this quick and informative post on Numpy Functions

2.2) Pandas

Pandas, aka “Python Data Analysis,” is yet another widely used python library that is used by everyone right from beginners to experienced data scientists. It is a powerful tool that offers a variety of ways to load and work with labeled datasets, calculate various statistical parameters, create visualizations, etc.

Now let us look at an example of converting a .CSV file to a Pandas data frame:

This data set does not have any null/ missing values. However, let me show you an example of replacing these values with fillna the method.

Apart from this, there are a number of other interesting methods that Pandas lets us perform. Do check out the official Pandas documentation here: Link

3) Data Analysis / Data Visualizations

The next step after cleaning the data is to gain insights, ask questions, find answers to these questions and keep iterating these steps until you feel you’ve gained enough insights to help solve the problem statement. The tools that help us with this step are majorly data visualization libraries.
Here we will look at Seaborn and Plotly.

3.1) Seaborn

Seaborn is a Python library that helps with creating beautiful visualizations of the datasets. It is innately built on matplotlib, thus making it very easy to learn and master. Let us look at an example by creating a pair-plot visualization on “Italy Covid Dataset.”

Apart from this, we can also generate heatmaps on valid datasets that reveal the correlation between each column in our datasets.

3.2) Plotly

Plotly is yet another interesting library that allows us to make interactive visualizations from the datasets using minimal code. Furthermore, these plotly charts can be embedded easily, making it one of the go-to tools used by corporate firms.
Let us look at an example of a parallel coordinates plot on the same above discussed dataset that was used to predict “Surface Smoothness” based on various input parameters.

  • I would personally recommend this free course “Zero To Pandas” by Jovian to improve your EDA skills.

4) Feature Engineering Using “Scikit-Learn”

After making visualizations and gaining insights from our dataset, the next step is to perform feature engineering on the dataset. There are a number of techniques that can be applied to the dataset based on the domain of the problem.

Scikit Learn is hands down one of the most important tools when it comes to training, performing feature engineering, or deploying machine learning models.
Let us now see an example of the same.
Note: Here the fit_transform does two things to the data.

  • It first calculates the mean and variance and saves it as internal objects.
  • Now using these calculated values, it applies transformation to our dataset.
  • The Label Encoder then normalizes the numeric values, i.e. in this case all the values end up between the range of [0,2]. We can also use One-Hot Encoder to help us with scaling numeric values.

5) Training Machine Learning Model

5.1) Scikit-Learn

After performing the feature engineering, the next step is to train a model using an appropriate machine learning algorithm that suits the dataset.
Scikit-Learn supports a lot of training modules, so choosing the right model for the dataset plays a very crucial role on the result.
For this example, let us use Support Vector Machines to train our model.

Now let’s take a look at an evaluation metric, i.e., R2 score and make a visualization of our models output.

5.2) Pytorch / Keras

While Scikit Learn supports a lot of general-purpose machine learning operations and algorithms, it has its own drawbacks when it comes to the field of deep learning. You can choose libraries like Pytorch and Keras to train your model based on your personal preference.
Here is a list of free tutorials that I would recommend to gain proficiency in this domain:
1. Deep Learning with PyTorch: Zero to GANs (Link)
2. Courses by DeepLearning.Ai (Link)

6) Model Deployment

There are a ton of tools and libraries out there to get your ML model deployed. For real-world production-ready models, business firms tend to use tools like AWS Sagemaker, TensorFlow Serving, etc.
Apart from the above mentioned, some other deployment tools include Django, Flask, and Streamlit. Moreover, Scikit learn also has Flask based framework that helps us quickly deploy our model.

6.1) Streamlit

Streamlit is a fairly new tool and but is one of the simplest ways to deploy your model as it involves writing significantly less code and almost no web programming experience.
Check out this practical walkthrough by Gurami Keretchashvili about deploying your model using streamlit.

Conclusion

In this post, we’ve discussed all the primary steps involved in working with a data science problem and the respective libraries that help us at each stage.
Here’s a summary of all the libraries that have been discussed in this post.
· Data Collection
1. Requests & Beautiful Soup
2. Selenium
· Data Preprocessing / Data Cleaning
1. Numpy
2. Pandas
· Data Analysis / Data Visualizations
1. Seaborn
2. Plotly
· Feature Engineering
1. Scikit-Learn
· 5) Training Machine Learning Model
1. Scikit-Learn
2. Pytorch / Keras
· 6) Model Deployment
1. Streamlit

References

  1. Python Libraries Every Data Scientist Must Know (Link)
  2. Top 10 Python Libraries For Data Science for 2022 (Link)
  3. Jovian.ai Topics (Link)

--

--

Vishnu Arun

A magical swordsman chasing butterflies through lores, breathing life into the enigmatic shadows of our collective past!