PythonPro #45: Converting DataFrames, Python Developer Survey, DBSCAN in 5 Minutes, and Web Scraping with Scrapy
Welcome to a brand new issue of PythonPro!
In today’s Expert Insight we bring you an excerpt from the recently published, Polars Cookbook, which shows you how to convert DataFrames and Series between Polars and pandas.
News Highlights: Python Developer Survey: 55% use Linux, 6% still on Python 2; SuperTree enables interactive decision tree visuals in Jupyter; and OneBusAway launches Python and JavaScript SDKs for seamless data integration.
Here are my top 5 picks from our learning resources today:
And, today’s Featured Study, highlights how process mining, using tools like pm4py, can uncover insights into workflow efficiency, variability, and algorithmic performance.
Stay awesome!
Divya Anne Selvaraj
Editor-in-Chief
P.S.: This month’s survey is now live. Do take the opportunity to tell us what you think of PythonPro, request learning resources, and earn your one Packt Credit for this month.
🐍 Python in the Tech 💻 Jungle 🌳
🗞️News
Python Developer Survey - 55% Use Linux, 6% Use Python 2: The 7th annual Python Developers Survey, which gathered responses from over 25,000 developers worldwide also found that Visual Studio Code is the leading IDE.
supertree
- Interactive Decision Tree Visualization: This Python package is designed to create interactive visualizations of decision trees within Jupyter Notebooks, Jupyter Lab, Google Colab, and similar environments that support HTML rendering.OneBusAway Launches Official Python and JavaScript SDKs: Developed as part of the Google Summer of Code, these SDKs simplify the incorporation of OneBusAway's data, offer consistent API usage across platforms, and include comprehensive documentation.
💼Case Studies and Experiments🔬
Exploring the National Park Service API - Harvesting and Visualizing Data for National Parks: Provides a step-by-step guide on accessing the API, retrieving data such as park entrance fees, and organizing it into a Pandas DataFrame for analysis.
Code Without Any Syntax: Discusses an experiment in which the author uses an LLM to convert natural language instructions into functional Python code without traditional syntax.
📊Analysis
Make magic with Mesop - python based web apps: Reviews Mesop, a newly released Python-based framework for building web apps. Read for tips to get started.
Why I Prefer Django for My Projects: While acknowledging the strengths of Node.js and Express.js, the author of this article finds Django's holistic, secure, and efficient approach better suited to their needs in web development.
🎓 Tutorials and Guides 🤓
Web Scraping With Scrapy and MongoDB: Guides you through setting up a Scrapy project, building a web scraper, extracting data, and storing it in MongoDB. Read to also learn about testing and debugging techniques.
Generate Images With DALL·E and the OpenAI API: Covers setting up the necessary environment, making API calls to create images from text prompts, handling image variations, and converting Base64 JSON responses to PNG files.
Primer on Jinja Templating: Covers installation, basic usage, and advanced features like loops, conditional statements, and macros. Read to learn how to integrate Jinja with Flask to build a basic web project with dynamic web pages.
How to Install Python on Your System - A Guide: Provides a comprehensive guide to installing Python on various systems, including Windows, macOS, Linux, iOS, and Android.
Adventures building a spreadsheet engine in Python: Demonstrates using the Lark Python package to parse formulas and compute dependencies, employing a topological sort algorithm to determine the order of cell evaluation.
How to write your first Genetic Algorithm — Knapsack Problem: Guides you through implementing a genetic algorithm using Python. Read to learn how to apply genetic algorithms to solve complex optimization problems.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Explained in 5 Minutes: Provides a concise explanation of the DBSCAN algorithm, which identifies clusters in data based on spatial distance and detects outliers without needing to predefine the number of clusters.
🔑Best Practices and Advice🔏
Escaping from Anaconda's Stranglehold on macOS: Provides simple, non-technical instructions to move the
.zshrc
file, allowing users to switch between Anaconda and official Python installations without terminal commands.Why I Still Use Python Virtual Environments in Docker: Argues that virtual environments simplify the management of Python applications, particularly in production settings, by ensuring consistent and isolated environments across different stages of development.
Python Classes - The Power of Object-Oriented Programming: Covers defining classes, creating objects, managing attributes and methods, and the benefits of using classes. Read to learn about advanced topics like inheritance.
Python packaging is a MESS: Stress-tests nine Python package managers, including pip, conda, poetry, and newer tools like pixi and hatch, highlighting the historical issues and modern solutions in Python packaging.
Use python -m http.server in SSL: Provides a custom script,
ssl_server.py
, that wrapshttp.server
to enable serving static sites over HTTPS using a self-signed SSL certificate. Read to learn how to serve static content securely.
🔍Featured Study: Mastering Robotic Control with PyRoboCOP for Complex Tasks💥
In "Navigating Process Mining: A Case Study using pm4py," Kovács et al., explore the application of the pm4py library in analysing road traffic fine management processes. The study aims to demonstrate how process mining can uncover key insights into process efficiency and optimisation.
Context
Process mining is a technique that combines data mining and business process management to analyse event logs generated by information systems. It is particularly effective for uncovering hidden patterns, identifying bottlenecks, and optimising workflows. The study focuses on applying the pm4py library, an open-source Python tool, to a real-world road traffic fine management process. This approach offers a deeper understanding of process execution compared to traditional business intelligence tools.
Key Findings
The study's application of process mining to road traffic fine management revealed significant insights into process variability, algorithmic performance, and workflow complexity:
Process Variants: The analysis identified 231 distinct process variants, with one variant accounting for 56,482 cases (approximately 37.6% of the total 150,370 cases), indicating a dominant workflow path.
Algorithm Performance: Three process mining algorithms were evaluated:
Alpha Miner: Revealed causal dependencies between activities, achieving simplicity and precision scores of 0.66.
Inductive Miner: Employed a recursive approach to construct process models, scoring 0.62 in simplicity and 0.58 in precision.
Heuristic Miner: Utilised heuristics to infer process models from event data, achieving a perfect precision score of 1.0 but a lower simplicity score of 0.54.
Start and End Events: The process log analysis showed that 'Create Fine' was the most frequent start event, occurring 150,370 times. Multiple end events, such as 'Send Fine', 'Payment', and 'Send for Credit Collection,' were identified, indicating diverse process pathways.
Process Discovery and Visualisation: The discovered models allowed a detailed understanding of workflow structures and dependencies. Each mining approach had strengths and limitations in capturing the process dynamics, with pm4py proving effective in facilitating process mining tasks.
What This Means for You
This study is relevant to data scientists, business analysts, and operations managers interested in optimising business processes. The pm4py library, as demonstrated in this case study, provides practical tools for analysing complex workflows, identifying inefficiencies, and improving operational efficiency. The insights gained can be applied to other business processes, making it a valuable resource for those aiming to enhance process performance.
Examining the Details
The study used the pm4py library to analyse an event log related to the management of road traffic fines, covering activities such as creating fines, sending fines, adding penalties, managing appeals, and handling payments. The analysis involved three process mining algorithms—Alpha Miner, Inductive Miner, and Heuristic Miner—to discover process models from the event log data. The evaluation of simplicity and precision across these algorithms revealed that the Heuristic Miner achieved the highest precision score of 1.0, while the Alpha Miner provided a balance between simplicity and accuracy.
You can learn more by reading the entire paper and accessing the pm4py library.
🧠 Expert insight💥
Here’s an excerpt from “Chapter 10: Interoperability with Other Python Libraries” in the Polars Cookbook, by Yuki Kakegawa, published in August 2024.
Converting to and from a pandas DataFrame
Many of you have used pandas before, especially in your day-to-day work. Although pandas and Polars are often compared as one-or-the-other tools, you can use these tools to supplement each other.
📚Related Titles from Packt
Understand key data science algorithms with Python-based examples
Increase the impact of your data science solutions by learning how to apply existing algorithms
Take your data science solutions to the next level by learning how to create new algorithms
Conduct Bayesian data analysis with step-by-step guidance
Gain insight into a modern, practical, and computational approach to Bayesian statistical modeling
Enhance your learning with best practices through sample problems and practice exercises
Polars allows you to convert between pandas and Polars DataFrames, which is exactly what we’ll cover in this recipe.
Getting ready
You need pandas
and pyarrow
installed for this recipe to work. Execute the following code to make sure that you have them installed:
pip install pandas pyarrow
How to do it...
Here’s how to convert to and from pandas DataFrames. We’ll first create a Polars DataFrame and then go through ways to convert back and forth between Polars and pandas:
Create a Polars DataFrame from a Python dictionary:
df = pl.DataFrame({
'a': [1,2,3],
'b': [4,5,6]
})
type(df)
The preceding code will return the following output:
>> polars.dataframe.frame.DataFrame
Convert a Polars DataFrame to a pandas DataFrame using the
.
to_pandas()
method:
pandas_df = df.to_pandas()
type(pandas_df)
The preceding code will return the following output:
>> pandas.core.frame.DataFrame
Convert a pandas DataFrame to a Polars DataFrame using the
.
from_pandas()
method:
df = pl.from_pandas(pandas_df)
type(df)
The preceding code will return the following output:
>> polars.dataframe.frame.DataFrame
If you want to allow zero copy operations, then you need to enable the
use_pyarrow_extension_array
parameter:
df.to_pandas(use_pyarrow_extension_array=True).dtypes
The preceding code will return the following output:
>>
a int64[pyarrow]
b int64[pyarrow]
dtype: object
You can also create a Polars DataFrame by wrapping a pandas DataFrame using pl.DataFrame()
:
type(pl.DataFrame(pandas_df))
The preceding code will return the following output:
>> polars.dataframe.frame.DataFrame
How it works...
Polars has built-in methods to interoperate with pandas such as .from_pandas()
and .to_pandas()
. Each method is descriptive enough that you can see that .from_pandas()
is used for reading data into Polars from pandas, whereas .to_pandas()
is used to convert Polars objects into pandas.
The use_pyarrow_extension_array
parameter of the .to_pandas()
method uses PyArrow-supported arrays instead of NumPy arrays for the columns within the pandas DataFrame. This enables zero-copy operations and maintains the integrity of null values.
There’s more...
You can convert to and from a pandas Series to a Polars Series:
s = pl.Series([1,2,3])
type(s.to_pandas())
The preceding code produces the following:
>> pandas.core.series.Series
The .from_pandas()
method returns a Series object when a pandas Series was passed in:
type(pl.from_pandas(s.to_pandas()))
The preceding code produces the following:
>> polars.series.series.Series
Packt library subscribers can continue reading the entire book for free. You can buy the Polars Cookbook, by Yuki Kakegawa, here.
And that’s a wrap.
We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most useful here. The complete PythonPro archives can be found here.
If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, take the survey or leave a comment below!