PythonPro #45: Converting DataFrames, Python Developer Survey, DBSCAN in 5 Minutes, and Web Scraping with Scrapy

Sep 05, 2024

Welcome to a brand new issue of PythonPro!

In today’s Expert Insight we bring you an excerpt from the recently published, Polars Cookbook, which shows you how to convert DataFrames and Series between Polars and pandas.

News Highlights: Python Developer Survey: 55% use Linux, 6% still on Python 2; SuperTree enables interactive decision tree visuals in Jupyter; and OneBusAway launches Python and JavaScript SDKs for seamless data integration.

Here are my top 5 picks from our learning resources today:

Exploring the National Park Service API - Harvesting and Visualizing Data for National Parks🌲
Web Scraping With Scrapy and MongoDB🕸️
DBSCAN, Explained in 5 Minutes🧩
Python packaging is a MESS📦
Why I Still Use Python Virtual Environments in Docker🛳️

And, today’s Featured Study, highlights how process mining, using tools like pm4py, can uncover insights into workflow efficiency, variability, and algorithmic performance.

Stay awesome!

Divya Anne Selvaraj

Editor-in-Chief

P.S.: This month’s survey is now live. Do take the opportunity to tell us what you think of PythonPro, request learning resources, and earn your one Packt Credit for this month.

Sign Up | Advertise

🐍 Python in the Tech 💻 Jungle 🌳

🗞️News

Python Developer Survey - 55% Use Linux, 6% Use Python 2: The 7th annual Python Developers Survey, which gathered responses from over 25,000 developers worldwide also found that Visual Studio Code is the leading IDE.
supertree - Interactive Decision Tree Visualization: This Python package is designed to create interactive visualizations of decision trees within Jupyter Notebooks, Jupyter Lab, Google Colab, and similar environments that support HTML rendering.
OneBusAway Launches Official Python and JavaScript SDKs: Developed as part of the Google Summer of Code, these SDKs simplify the incorporation of OneBusAway's data, offer consistent API usage across platforms, and include comprehensive documentation.

💼Case Studies and Experiments🔬

Exploring the National Park Service API - Harvesting and Visualizing Data for National Parks: Provides a step-by-step guide on accessing the API, retrieving data such as park entrance fees, and organizing it into a Pandas DataFrame for analysis.
Code Without Any Syntax: Discusses an experiment in which the author uses an LLM to convert natural language instructions into functional Python code without traditional syntax.

📊Analysis

Make magic with Mesop - python based web apps: Reviews Mesop, a newly released Python-based framework for building web apps. Read for tips to get started.
Why I Prefer Django for My Projects: While acknowledging the strengths of Node.js and Express.js, the author of this article finds Django's holistic, secure, and efficient approach better suited to their needs in web development.

🎓 Tutorials and Guides 🤓

Web Scraping With Scrapy and MongoDB: Guides you through setting up a Scrapy project, building a web scraper, extracting data, and storing it in MongoDB. Read to also learn about testing and debugging techniques.
Generate Images With DALL·E and the OpenAI API: Covers setting up the necessary environment, making API calls to create images from text prompts, handling image variations, and converting Base64 JSON responses to PNG files.
Primer on Jinja Templating: Covers installation, basic usage, and advanced features like loops, conditional statements, and macros. Read to learn how to integrate Jinja with Flask to build a basic web project with dynamic web pages.
How to Install Python on Your System - A Guide: Provides a comprehensive guide to installing Python on various systems, including Windows, macOS, Linux, iOS, and Android.
Adventures building a spreadsheet engine in Python: Demonstrates using the Lark Python package to parse formulas and compute dependencies, employing a topological sort algorithm to determine the order of cell evaluation.
How to write your first Genetic Algorithm — Knapsack Problem: Guides you through implementing a genetic algorithm using Python. Read to learn how to apply genetic algorithms to solve complex optimization problems.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Explained in 5 Minutes: Provides a concise explanation of the DBSCAN algorithm, which identifies clusters in data based on spatial distance and detects outliers without needing to predefine the number of clusters.

🔑Best Practices and Advice🔏

Escaping from Anaconda's Stranglehold on macOS: Provides simple, non-technical instructions to move the .zshrc file, allowing users to switch between Anaconda and official Python installations without terminal commands.
Why I Still Use Python Virtual Environments in Docker: Argues that virtual environments simplify the management of Python applications, particularly in production settings, by ensuring consistent and isolated environments across different stages of development.
Python Classes - The Power of Object-Oriented Programming: Covers defining classes, creating objects, managing attributes and methods, and the benefits of using classes. Read to learn about advanced topics like inheritance.
Python packaging is a MESS: Stress-tests nine Python package managers, including pip, conda, poetry, and newer tools like pixi and hatch, highlighting the historical issues and modern solutions in Python packaging.
Use python -m http.server in SSL: Provides a custom script, ssl_server.py, that wraps http.server to enable serving static sites over HTTPS using a self-signed SSL certificate. Read to learn how to serve static content securely.

🔍Featured Study: Mastering Robotic Control with PyRoboCOP for Complex Tasks💥

In "Navigating Process Mining: A Case Study using pm4py," Kovács et al., explore the application of the pm4py library in analysing road traffic fine management processes. The study aims to demonstrate how process mining can uncover key insights into process efficiency and optimisation.

Context

Process mining is a technique that combines data mining and business process management to analyse event logs generated by information systems. It is particularly effective for uncovering hidden patterns, identifying bottlenecks, and optimising workflows. The study focuses on applying the pm4py library, an open-source Python tool, to a real-world road traffic fine management process. This approach offers a deeper understanding of process execution compared to traditional business intelligence tools.

Key Findings

The study's application of process mining to road traffic fine management revealed significant insights into process variability, algorithmic performance, and workflow complexity:

Process Variants: The analysis identified 231 distinct process variants, with one variant accounting for 56,482 cases (approximately 37.6% of the total 150,370 cases), indicating a dominant workflow path.
Algorithm Performance: Three process mining algorithms were evaluated:
- Alpha Miner: Revealed causal dependencies between activities, achieving simplicity and precision scores of 0.66.
- Inductive Miner: Employed a recursive approach to construct process models, scoring 0.62 in simplicity and 0.58 in precision.
- Heuristic Miner: Utilised heuristics to infer process models from event data, achieving a perfect precision score of 1.0 but a lower simplicity score of 0.54.
Start and End Events: The process log analysis showed that 'Create Fine' was the most frequent start event, occurring 150,370 times. Multiple end events, such as 'Send Fine', 'Payment', and 'Send for Credit Collection,' were identified, indicating diverse process pathways.
Process Discovery and Visualisation: The discovered models allowed a detailed understanding of workflow structures and dependencies. Each mining approach had strengths and limitations in capturing the process dynamics, with pm4py proving effective in facilitating process mining tasks.

What This Means for You

This study is relevant to data scientists, business analysts, and operations managers interested in optimising business processes. The pm4py library, as demonstrated in this case study, provides practical tools for analysing complex workflows, identifying inefficiencies, and improving operational efficiency. The insights gained can be applied to other business processes, making it a valuable resource for those aiming to enhance process performance.

Examining the Details

The study used the pm4py library to analyse an event log related to the management of road traffic fines, covering activities such as creating fines, sending fines, adding penalties, managing appeals, and handling payments. The analysis involved three process mining algorithms—Alpha Miner, Inductive Miner, and Heuristic Miner—to discover process models from the event log data. The evaluation of simplicity and precision across these algorithms revealed that the Heuristic Miner achieved the highest precision score of 1.0, while the Alpha Miner provided a balance between simplicity and accuracy.

You can learn more by reading the entire paper and accessing the pm4py library.

🧠 Expert insight💥

Here’s an excerpt from “Chapter 10: Interoperability with Other Python Libraries” in the Polars Cookbook, by Yuki Kakegawa, published in August 2024.

Converting to and from a pandas DataFrame

Many of you have used pandas before, especially in your day-to-day work. Although pandas and Polars are often compared as one-or-the-other tools, you can use these tools to supplement each other.

📚Related Titles from Packt

15 Math Concepts Every Data Scientist Should Know, Published by Packt, Book Cover

Understand key data science algorithms with Python-based examples
Increase the impact of your data science solutions by learning how to apply existing algorithms
Take your data science solutions to the next level by learning how to create new algorithms

Get the eBook for ~~$35.99~~
$24.99!

Bayesian Analysis with Python, Published by Packt, Book Cover

Conduct Bayesian data analysis with step-by-step guidance
Gain insight into a modern, practical, and computational approach to Bayesian statistical modeling
Enhance your learning with best practices through sample problems and practice exercises

Get the eBook for ~~$55.99~~
$38.99!

Polars allows you to convert between pandas and Polars DataFrames, which is exactly what we’ll cover in this recipe.

Getting ready

You need pandas and pyarrow installed for this recipe to work. Execute the following code to make sure that you have them installed:

pip install pandas pyarrow

How to do it...

Here’s how to convert to and from pandas DataFrames. We’ll first create a Polars DataFrame and then go through ways to convert back and forth between Polars and pandas:

Create a Polars DataFrame from a Python dictionary:

df = pl.DataFrame({
'a': [1,2,3],
'b': [4,5,6]
})
type(df)