PythonPro #4: Analyze Data 170,000x Faster and How Python lowers the entry barrier for Data Scientists

Bite-sized actionable content, practical tutorials, and resources for Python programmers

Nov 02, 2023

"The more comfortable a programming language is to learn, the lower the entry barrier is for more math and stats-oriented people."
– Sebastian Raschka (2018), In an interview published in the book Python Interviews

Welcome to this week’s issue of Python Pro!

Python's accessibility and readability are key factors in its popularity for AI and machine learning. This is what Sebastian Raschka, a prominent figure in Python for AI and machine learning, feels makes the language the right choice for him and many other data scientists. But beyond this, his deep involvement in opensource projects like scikit-learn and contributions to data science tools like mlxtend and BioPandas highlight his dedication to advancing the field of data science and ML through Python. Read the complete interview in our Expert Insight section in today’s issue.

And, from around the web we have a fresh round up of useful reading and learning resources including:

Stay awesome!

Divya Anne Selvaraj

Editor-in-Chief

P.S.: If you want us to find a tutorial for you the next time or give us any feedback, leave your comments at the end.

Sign Up | Advertise

🐍 Python in the Tech 💻 Jungle 🌳

🗞️News

Patching the libwebp vulnerability across the Python ecosystem: In this article Seth Larson talks about how the Python ecosystem addressed the critical libwebp vulnerability (CVE-2023-4863). Read for insights into the process of identifying and patching vulnerabilities in open-source software components within the Python ecosystem.
Why is the Django Admin “Ugly”?: It turns out, it's by design. Co-creator Jacob Kaplan-Moss explained it was originally created for a weekly newspaper. Read if you're considering giving it to customers and check out the WagtailModelAdmin for improved accessibility and a better user experience.
dj-notebook: the REPL I’ve Always Wanted for Django: The REPL seamlessly combines Jupyter notebooks with Django and makes development smoother. Read to learn how you can add it to your workflow, create a Jupyter notebook, and experience efficient iteration with PyCharm without Jupyter setup headaches.

💼Case Studies and Experiments🔬

Sentry, a Developer’s Partner, Interview with Co-Founder David Cramer: In this audio interview, David Cramer talks about how Sentry's developer-first approach contributed to its growth and its synergy with Python. Listen for product development insights and learn about plans to push the Python ecosystem forward and make it more functional for modern UI frameworks.
Support Ticket Classification using TF-IDF Vectorization: In this useful case study, the author explains the TF-IDF process, vectorization, and the curse of high dimensionality. Read for instructions on deploying the best model for classifying new support tickets and understand the importance of converting text data into numeric form for classification.

📊Analysis

Novice to Ninja: Why Your Python Skills Matter in Data Science: This article compares and analyzes code written by novice to ninja Python programmers. Read to understand the benefits of coding proficiency, like efficiency, readability, reusability, robustness, skill recognition, adaptability, and cost-effectiveness while using Python for data science.
Unveiling the future of data extraction using Python and AI for video-based information recognition: This article delves into the pivotal role Python plays in automating video data extraction. Read for insights into techniques and methodologies and to explore case studies in healthcare, e-commerce, and surveillance, showcasing how Python and AI enhance real-time product recognition, automated identification of suspicious activities, and more.

🎓 Tutorials and Guides 🤓

How to migrate Python functions from AWS Lambda to OpenFaaS and Kubernetes: This guide covers the entire migration process from setting up Linode ObjectStorage, to configuring function dependencies, and refactoring the function handler. Read to learn how to migrate a real-world ETL Python function, including configuringS3 clients and handling event triggers.
How to Use TypeHints for Multiple Return Types in Python: When working with functions that may return different types of results, you can specify multiple return types using the pipe operator (|) in Python 3.10 or newer. Read to learn how to use type hints effectively, especially for conditional statements, optional values, error handling, and higher-order functions.
Unlocking the Power of Delta Live Tables for Streamlined Data Pipelines: This article dives deep into the Delta Live Tables Python interface, covering key aspects like limitations, module import, table and view creation, and more. Read for essential guidance and examples for building efficient pipelines.
Combine year and month columns in Pandas: This is a critical skill for time-based insights and trend analysis. Read to learn how you can achieve this by going through four essential steps, including loading the Pandas library, checking your data, merging columns, and optionally saving your modified dataset.
Generate All Subarrays of an Array in Python: This tutorial explores three approaches including nested loops, list comprehension, and recursion for efficient subarray generation. Read to learn which approach to choose when dealing with varying levels of complexity and larger arrays.

🔑 Best Practices and Code Optimization 🔏

`eval` should not be a built-in function: The author of this article proposes reducing eval misuse by making it less discoverable to promote safer coding practices. Read to learn why the use of eval requires caution and how related security and code complexity issues can be avoided.
Analyzing Data170,000x Faster with Python: This article takes you through 13 step-by-step optimizations, primarily using Numba to accelerate key functions and data structures in the code. Read to learn how to achieve a 170,000x speedup through techniques like bitsets, Numba compilation, and custom operations.
Python is a Compiled Language: This article highlights the importance of understanding the dynamic vs. static aspects of a programming language. Read to understand why the author challenges the conventional view of Python as an interpreted language by dissecting its compilation stages and error reporting.
Linear Search in Python: This article explains how Linear Search works, even with unsorted data, and provides a step-by-step example. Read if you want to learn how to implement Linear Search in Python and understand its time and space complexity.
Python grequests: Making Asynchronous HTTP Requests: This article explains how to use grequests, a library for making asynchronous HTTP requests, to improve application efficiency and responsiveness by sending multiple requests in parallel. Read to learn about imitations and how to enhance performance using the imap method.

🧠 Expert insight 📚

50 Algorithms Every Programmer Should Know - Second Edition

Here’s an exclusive excerpt from Python Interviews by Michael Driscoll, a treasure trove of a book made up of 20 one-to-one interviews with leading Python programmers and luminaries in the field.

Chapter 12: Sebastian Raschka

Sebastian Raschka received his doctorate in Quantitative Biology and Biochemistry and Molecular Biology in 2017, from Michigan State University. His research activities included the development of new deep learning architectures to solve problems in the field of biometrics. Sebastian is the bestselling author of Python Machine Learning, the first edition of which received the ACM Computing Reviews’ Best of 2016 award.

He contributes to many open source projects including scikit-learn. Methods that Sebastian implemented are being used in real-world machine learning applications such as Kaggle. He is passionate about helping people to develop data-driven solutions.

Discussion themes: Python for AI/machine learning, v2.7/v3.x.

Q: Could you give a little background information about yourself?

A: Of course! My name probably already gives it away, but I was born and raised in Germany, where I lived for more than two decades, until I had the urge to go on an adventure and study in the US.

I received my undergraduate degree from Heinrich-Heine University in Düsseldorf. I remember one day walking to the cafeteria and stumbling upon a flyer regarding a study abroad program with Michigan State University (MSU). I was super intrigued and thought that this might be a worthwhile experience. So not long after that, I studied for two years at MSU and received a Bachelor Plus/International degree.

During those two semesters, I made many friends at MSU and thought that the scientific environment would provide an excellent opportunity for me to grow as a scientist, which is why I applied for grad school at MSU. I should say that this chapter of my life came with a happy ending, as I obtained my Ph.D. in December 2017. So that's my academic career.

During my time as a graduate student, I got heavily involved in open source in the context of data science and machine learning. Also, I am a passionate blogger and writer. Some people may have stumbled upon my book, Python Machine Learning, which was very well-received by both people from academia and the industry.

With my book, I tried to bridge the gap between purely practical (that is, coding) books and purely theoretical (i.e., math-heavy) works. Based on all of the feedback that I received, Python Machine Learning turned out to be super useful to a broad audience. The book was translated into seven languages and is currently used as a textbook at the Loyola University Chicago, the University of Oxford, and many others.

Q: Do you contribute to any open source projects?

A: Yes, besides my writings, I am contributing to open source projects such as scikit-learn, TensorFlow and PyTorch. I also have my own little open source projects that I work on in my free time, including mlxtend and BioPandas.

mlxtend is a Python library with useful tools for the day-to-day data science tasks. It aims to fill the gap in the Python data science system, by providing tools that are not yet available in other packages. For example, the stacking classifiers and regressors, as well as the sequential feature selection algorithms, are very popular in the Kaggle community.

In addition, the frequent pattern mining algorithms, including Apriori and algorithms for deriving association rules, are super handy. Most recently, I added a lot of non-parametric functions, for evaluating machine learning classifiers from bootstrapping, to McNemar's tests.

The BioPandas project arose from the need to work with molecular structures from different file formats more conveniently. During my Ph.D., many projects involved working with protein structures, or structures of small (drug-like) molecules. There are many tools out there for that, but each has its own little sublanguage. To stay most productive, I didn't want to learn a whole new API for each little side project.

The idea behind BioPandas is to parse structural files into pandas DataFrames, a library and format that most data scientists are already familiar with. Once the structures are in a DataFrame format, we can use all of the power of pandas that is at our disposal, including its super flexible selection syntax.

A virtual screening tool that I recently developed, screenlamp, makes heavy use of BioPandas as its core engine. I could screen databases with more than 12 million molecules efficiently, which led to the successful discovery of potent G protein-coupled receptor signaling inhibitors, with applications to aquatic invasive species control, in collaboration with experimental biologists at MSU.

Besides all of my involvement in computational biology, one of my other passion projects involves semi-adversarial networks. Semi-adversarial networks are a deep learning architecture that I developed with my collaborators in the iPRoBe Lab at MSU, which we successfully applied in the context of privacy concerns in the field of biometrics.

In particular, we applied this architecture to perturb face images in such a way that they looked almost identical to the original input images, while soft biometric attributes, such as gender, were inaccessible by gender predictors. The overall goal is to prevent nasty things like profiling, based on soft biometric attributes, without a user's consent.

Q: So why did you become a programmer?

A: I would say that the primary driving factor for becoming a programmer was to be able to implement my 'crazy' research ideas.

In computational biology, we already have many tools at our disposal that we can use without the need to program ourselves. However, using existing tools (depending on the research task) can also be a bit limiting. If we want to try something new, especially if we want to develop new methods, then there is no way around learning how to program.

Like most people, I started with simple Bash scripting in a Linux shell. At some point, I realized that this wasn't quite enough, or not efficient enough. During my undergraduate studies in Germany, I took a bioinformatics class in Perl.

When I saw what was possible with Perl, this was quite an eye-opening experience. Later, when I was conducting statistical analyses and preparing data visualizations based on the data that I collected, I also got into R. Not long after that, I got into Python.

Q: Why Python?

A: Well, I mentioned that I started with Perl and R. However, one thing that most programmers have in common is that we consult the internet on a regular basis to look for useful pointers, and other tips and tricks for achieving certain subtasks.

Suffice it to say, I stumbled upon many different resources that were written in Python and I thought that it would be worthwhile learning this language. At some point, I moved away from Perl entirely and did all of my coding in Python: custom scripts for data collection, parsing and analysis.

I also have to mention that I did all of the statistical analyses and plotting in R. Actually, not too long ago, when I was revisiting an old project, I stumbled upon my old Frankenstein-esque scripts (Bash scripts and makefiles), which were running Python and R in tandem.

Now, back in 2012, when the scientific computing stack was growing quickly, I stumbled upon NumPy, SciPy, matplotlib and scikit-learn. I realized that everything that I did in R, I could also do in Python. I could avoid switching back and forth between languages in my projects.

Looking back, picking up Python was probably one of the best decisions that I made. Without Python, it wouldn't have been possible for me to be so productive. But besides research and work, I really enjoy being part of and interacting with the vivid Python community. Whether I am interacting with people via Twitter, or meeting people at conferences like PyData and SciPy, it's always a fun experience.

Q: Python is one of the languages that is being used in AI and machine learning right now. Could you explain what makes it so popular?

A: I think there are two main reasons, which are very related. The first reason is that Python is super easy to read and learn.

I would argue that most people working in machine learning and AI want to focus on trying out their ideas in the most convenient way possible. The focus is on research and applications, and programming is just a tool to get you there. The more comfortable a programming language is to learn, the lower the entry barrier is for more math and stats-oriented people.

Python is also super readable, which helps with keeping up-to-date with the status quo in machine learning and AI, for example, when reading through code implementations of algorithms and ideas. Trying new ideas in AI and machine learning often requires implementing relatively sophisticated algorithms and the more transparent the language, the easier it is to debug.

The second main reason is that while Python is a very accessible language itself, we have a lot of great libraries on top of it that make our work easier. Nobody would like to spend their time on reimplementing basic algorithms from scratch (except in the context of studying machine learning and AI). The large number of Python libraries which exist help us to focus on more exciting things than reinventing the wheel.

By the way, Python is also an excellent wrapper language for working with more efficient C/C++ implementations of algorithms and CUDA/cuDNN, which is why existing machine learning and deep learning libraries run efficiently in Python. This is also super important for working in the fields of machine learning and AI.

To summarize, I would say that Python is a great language that lets researchers and practitioners focus on machine learning and AI and provides less of a distraction than other languages.

Q: Were there any moments where things may have gone another way, but surreptitiously ended up the way that they did?

A: That's a good question. Maybe the fact that Python was popular among the Linux community, but worked very well on Windows as well. This was likely a big contributor to Python becoming so popular today.

There are relatively similar languages out there like Ruby. The Ruby on Rails project was (and still is) super popular. If projects like Django hadn't started, Python might have become less popular as an all-rounder, which may have led to fewer resources and open source contributions being devoted to developing Python. In turn, Python may have been less popular as a language for machine learning and AI.

If Travis Oliphant hadn't started the NumPy project (it was called Numeric back then in 1995), I think fewer scientists would have picked up Python as a scientific programming language early in their careers. We would all still be using MATLAB.

Q: So is Python just the right tool at the right time, or is there another reason that it's become so important in AI and machine learning?

A: I think that's a bit of a chicken or the egg problem.

To untangle it, I would say that Python is convenient to use, which led to its wide adoption. The community has developed many useful packages in the context of scientific computing. Many machine learning and AI developers prefer Python as a general programming language for scientific computing, and they have developed libraries on top of it, like Theano, MXNet, TensorFlow and PyTorch.

On an interesting side note, having been active in the machine learning and deep learning communities, there was one thing that I heard very often: "The Torch library is awesome, but it is written in Lua, and I don't want to spend my time learning yet another language." Note that we have PyTorch now.

Q: Do you think this opens the door for any Python programmer to start experimenting with AI?

A: I do think so! It depends on how we interpret AI, but regarding deep learning and reinforcement learning, there are many convenient packages with Python wrappers out there.

Probably the most popular example at the moment would be TensorFlow. Personally, I use both TensorFlow and PyTorch in my current research projects. I have been using TensorFlow since it was released in 2015 and like it overall. However, it is a bit less flexible when trying out unusual research ideas, which is why I recently got more into PyTorch. PyTorch itself is more flexible and its syntax is closer to Python; in fact, PyTorch describes itself as "a deep learning framework that puts Python first."

Q: What could be done to make Python a better language for AI and machine learning?

A: While Python is a language that is very convenient to use and nicely interfaces with C/C++ code, we have to keep in mind that it is not the most efficient language.

Computational efficiency is why C/C++ is still the programming language of choice for several machine learning and AI developers. Also, Python is not supported on most mobile and embedded devices. Here we have to distinguish between research, development and production.

The convenience of Python comes at a price, which is performance. On the other hand, speed and computational efficiency comes with a trade-off in terms of productivity. In practice, I think that it's usually best to split tasks when working in a team, for instance, having people who specialize in research and trying new ideas, and people who specialize in taking prototypes to production.

I am mainly a researcher and haven't run into this problem yet, but I have also heard that Python is not good for production. I think this is mainly due to existing infrastructure, however, and the tools that are supported by the servers, so it's not really Python's fault per se.

In general, due to its nature as a high-level and general-purpose programming language, Python doesn't scale as well as other languages such as Java or C++, although they are more tedious to use. For instance, spending too much time in the Python runtime, when working with TensorFlow, can be a real performance killer. Improving the general efficiency of Python (I don't think this is really possible though while keeping Python as convenient as it is) would be beneficial to AI and machine learning.

While Python provides a great environment for rapid prototyping, it is sometimes a little bit too forgiving and dynamic types allow you to make mistakes more easily. I think the recent introduction of type hints may help to improve this issue to some extent. Also, keeping type hints optional is a great idea, because while it helps with larger code bases, it can also be an annoyance for smaller coding projects.

Q: What are you most excited about in Python today?

A: I am super excited that I can do anything that I need in Python. I can spend my time efficiently on research and problem solving, without the need to spend most of my days learning new tools and programming languages.

Sure, sometimes it's good to look beyond the Python ecosystem, to see what's out there and what could potentially be useful. However, overall, I am super happy with the status quo of Python. I am excited about the continued development of the fundamental data science libraries like NumPy, which received a large grant from the Moore Foundation to focus on improving the library even further.

Also, I recently saw a conference talk on the redesign of pandas, pandas 2, which will make this already great library even more efficient, without changing the user interface.

The one thing I am probably most excited about, though, is the great community around Python. It's great to feel part of the Python community and to be in the same boat regarding advancing the landscape of tools and science. I can share knowledge, learn from others and share my excitement with likeminded people.

Q: What do you think about the long life of Python 2.7? Should people move over?

A: That's a good question. Personally, I always recommend using the latest version of Python. However, I also realize that this is not always possible for everyone.

If your project involves working on or with an older Python 2.7 code base, then it may not be feasible to make the switch in terms of resources. Regarding the long life of Python 2.7, we all know that Python 2.7 will not be officially maintained after 2020. One thing that might happen is that a subcommunity will take over the maintenance of Python 2.7.

I also wonder whether it would be worthwhile to spend the energy and resources maintaining Python 2.7 after 2020 as a side project, versus taking the time to port Python 2.7 code bases over to Python 3.x. The long-term maintenance of Python 2.7 will always remain uncertain.

Personally, I always install the latest version of Python when it comes out and do all of my coding in Python 3. However, most of my projects also support Python 2.7. The reason is that there are still many people using Python 2.7 who cannot switch, and I don't want to exclude anyone. So if it does not require any major hassle or clunky workarounds, then I write my code in a way that is compatible with both Python 2.7 and 3.x.

Q: What changes would you like to see in future Python releases?

A: My apologies, but my answer is a rather boring one: I am quite happy with Python's current set of features and don't have anything significant on my wish list.

One thing that I and multiple other people are sometimes complaining about is Python's Global Interpreter Lock (GIL). However, for my needs, it's typically not an issue. For instance, I like control over when to do multithreading or multiprocessing.

I wrote my little multiprocessing wrappers (in the mputil package) to evaluate Python generators lazily, which was an issue concerning memory consumption when I was working with vanilla Pool classes from Python's multiprocessing standard library. Besides, there are great libraries out there, like joblib, which make multiprocessing and threading super convenient.

On top of that, most libraries that I use for the heavy lifting when it comes to doing computations in parallel (Dask, TensorFlow, and PyTorch) already support multiprocessing and use Python more as a glue language as I mentioned earlier, so that computational efficiency is never really an issue.

Driscoll: Thank you, Sebastian Raschka.

Python Interviews by Michael Driscoll was published in 2018. You can buy the book here. The interviewee, Sebastian Raschka, has authored several books including the bestselling Python Machine Learning, 3rd Edition (2019) and Machine Learning with PyTorch and Scikit-Learn (2022).

Explore more Python Interviews!

And that’s a wrap.

We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most useful here. If you have any feedback, leave your comments below.