PythonPro #71: Pandas 3.0 Ditches NumPy, Pyrefly vs. ty, and HuggingFace for Object Detection

Jun 09, 2025

Welcome to a brand new issue of PythonPro!

News Highlights: Pandas 3.0 adopts PyArrow for faster string handling; Meta releases Pyrefly, a Rust-based type checker for large Python codebases; String Grouper gets 8× faster; and Muffin tops new ASGI benchmarks, beating FastAPI on JSON throughput.

My top 5 picks from today’s learning resources:

And, in From the Cutting Edge, we introduce dro, a Python library that makes state-of-the-art distributionally robust optimization techniques practical and scalable for machine learning by unifying 79 methods into a single modular framework compatible with scikit-learn and PyTorch.

Stay awesome!

Divya Anne Selvaraj

Editor-in-Chief

Sign Up | Advertise

🐍 Python in the Tech 💻 Jungle 🌳

🗞️News

Python Pandas Ditches NumPy for Speedier PyArrow: Pandas 3.0 introduces PyArrow as a required dependency and default for string data, marking a shift toward faster, columnar data processing—though full replacement of NumPy as the backend remains experimental.
Meta Open-Sources Pyrefly, a High-Performance Python Type Checker in Rust: The type checker is designed to replace the OCaml-based Pyre and support responsive, scalable IDE typechecking—especially for large codebases like Instagram.
Even Faster String matching in Python: The latest version of String Grouper, a Python library for fuzzy string matching using TF-IDF and cosine similarity, is now 8× faster than its original release.
Benchmarks for MicroPie v0.9.9.8: A benchmark comparing seven ASGI frameworks using a simple JSON "hello world" response showed that Muffin delivered the highest performance while FastAPI trailed with the lowest throughput.
MonsterUI: Bringing Beautiful UI to FastHTML: MonsterUI is a Python library that simplifies frontend development for FastHTML apps by providing pre-styled, responsive UI components with smart defaults.

💼Case Studies and Experiments🔬

Rhyme Analysis of Virgil’s Æneid in English translation — Part 2: Uses Python and CMUDict to detect rhyme patterns in Edward Fairfax Taylor’s English translation of Virgil’s Æneid, achieving over 92% accuracy in capturing the Spenserian stanza structure.
A Python frozenset interpretation of Dependent Type Theory: Illustrates how Python can serve as an intuitive metatheory for understanding complex type-theoretic concepts through executable, computable analogues.

📊Analysis

Pyrefly vs. ty: Comparing Python’s Two New Rust-Based Type Checkers: Compares two emerging Rust-based Python type checkers—pyrefly (by Meta) and ty (by Astral)—based on speed, design goals, incrementalization strategies, and type inference behavior.
From Rows to Vectors: Under the Hood of DFEmbedder — A DataFrame Vector Store: Introduces DFEmbedder, an open source Python library that transforms tabular data into a low-latency vector store using static CPU-based embeddings.

🎓 Tutorials and Guides 🤓

Making C and Python Talk to Each Other: Covers locating and including Python.h , initializing and finalizing the Python interpreter, loading Python modules, calling Python functions (with and without arguments), and managing memory using PyObject references.
Building an MCP server as an API developer: Walks you through building and deploying a stateless MCP server using Python, FastAPI, and AWS services, illustrating how to integrate OAuth-secured Strava APIs and support Streamable HTTP transport for LLM-assisted applications.
Object Detection with Python and HuggingFace Transformers: Walks you through building an object detection pipeline while explaining how Transformer-based models like Detection Transformer (DETR) work and demonstrating a complete implementation.
Expected Goals on Target (xGOT) 101: Explains a post-shot metric that improves on xG by factoring in shot placement, power, and trajectory—demonstrating how analysts use it to evaluate strikers’ finishing skill and goalkeepers’ shot-stopping, with a Python template.
Regression Trees Explained: The Most Intuitive Intoduction: Offers a step-by-step explanation and Python implementation of regression trees, illustrating how they partition feature space and make predictions through recursive variance minimization.
Efficiently dissolving adjacent polygons by attributes in a large GIS database: Demonstrates a step-by-step method with SQL and Python to cluster, merge, and reduce over 750,000 land-use records into fewer, generalized geometries.
Tracking Urban Expansion Through Satellite Imagery: Covers selecting satellite imagery, preparing training data, computing indices, running classification, interpreting outputs, and validating results.

🔑Best Practices and Advice🔏

Matplotlib Alternatives That Actually Save You Time: Compares five modern Python visualization libraries—Plotly, Seaborn, Vega-Altair, Bokeh, and Plotnine—as more efficient, interactive, and expressive alternatives to Matplotlib.
Automate Your Life: Five Everyday Tasks Made Easy With Python: Showcases five simple, real-world Python scripts—generating QR codes, converting text to speech, translating text, taking screenshots, and censoring profanity.
Serving Deep Learning in AdTech: Offers practical guidance on choosing a model-serving approach based on system constraints, latency, and deployment needs.
What's the Difference Between Zipping and Unzipping Your Jacket? • Unzipping in Python: Explains how Python’s zip() function not only combines multiple iterables into grouped tuples but can also be used in reverse—with unpacking—to "unzip" them back into separate iterables.
The Chores Rota (#3 in The `itertools` Series • `cycle()` and Combining Tools): Uses a fictional story to teach Python's itertools.cycle() and zip() functions, illustrating how to create synchronized infinite iterators for task rotation.

🔍From the Cutting Edge: DRO for ML💥

In "DRO: A Python Library for Distributionally Robust Optimization in Machine Learning," Liu et al. introduce dro, a Python library that brings together state-of-the-art distributionally robust optimization (DRO) techniques into a single, modular, and scalable software package for supervised learning tasks.

Context

DRO is a technique used in machine learning to build models that remain reliable under uncertainty—especially when there's a mismatch between training and deployment data distributions. This is crucial in high-stakes domains like healthcare, finance, and supply chain systems. DRO typically addresses this challenge by considering a worst-case loss over an ambiguity set: a collection of distributions close to the empirical training data under some metric.

However, despite its theoretical promise, DRO has seen limited practical adoption due to the computational complexity of solving min-max problems and the lack of general-purpose libraries. Existing tools often either focus on a narrow subset of formulations or require users to manually reformulate and solve optimisation problems using external solvers.

The dro library directly addresses these gaps. It offers the first comprehensive, ML-ready implementation of diverse DRO formulations within a unified, modular Python package. Compatible with both scikit-learn and PyTorch, dro abstracts away the need for manual optimisation reformulations and enables scalable training, evaluation, and experimentation with robust models. This makes cutting-edge DRO techniques accessible to both practitioners and researchers, and usable in real-world workflows.

Key Features of dro

Comprehensive coverage: The library supports 79 DRO method combinations across 14 formulations and 9 model backbones, covering linear, kernel-based, tree-based, and neural models.
Seamless integration: All components follow the scikit-learn estimator interface and are compatible with PyTorch, enabling easy integration into existing machine learning workflows.
Significant speed improvements: The library applies vectorisation, kernel approximation, and constraint reduction techniques to achieve 10× to 1000× speedups over baseline implementations.
Flexible customisation: Users can personalise loss functions, model architectures, and robustness parameters through a modular design that supports both exact and approximate optimisation.
Built-in diagnostics: The package includes tools to generate worst-case distributions and evaluate out-of-sample performance, supporting principled model assessment under distribution shift.

What This Means for You

The dro library is especially relevant for machine learning researchers, applied data scientists, and engineers working in high-stakes or shift-prone domains such as healthcare, finance, and logistics. It offers a practical pathway to integrate distributional robustness into real-world pipelines without requiring manual optimisation reformulations or deep expertise in convex programming. By unifying a wide range of DRO methods within a standardised, high-performance framework, dro enables users to develop models that remain reliable under uncertainty, experiment with robustness techniques at scale, and bridge the gap between theoretical advances and practical deployment.

Examining the Details

The dro library operationalises Distributionally Robust Optimization by solving min–max problems where the outer minimisation spans a model class and the inner maximisation ranges over an ambiguity set of plausible distributions. This ambiguity set is defined using distance metrics such as Wasserstein distances, f-divergences (KL, χ², Total Variation, CVaR), kernel-based distances like Maximum Mean Discrepancy (MMD), and hybrid measures including Sinkhorn and Moment Optimal Transport distances.

Exact optimisation is handled through disciplined convex programming using CVXPY, applicable to linear and kernel-based models with standard losses such as hinge, logistic, ℓ₁, and ℓ₂. For more complex architectures like neural networks and tree ensembles, the library employs approximate optimisation strategies using PyTorch, LightGBM, and XGBoost.

To enhance scalability, the authors implement performance-optimisation techniques such as constraint vectorisation, Nyström kernel approximation, and constraint subsampling or sparsification, significantly reducing computational overhead without sacrificing accuracy. The methodology is underpinned by modular abstractions that isolate model type, loss function, and robustness metric, making the framework both extensible and maintainable.

Additional tooling supports synthetic and real-world dataset generation, worst-case distribution derivation, and corrected out-of-sample evaluation.

You can learn more by reading the entire paper here and accessing the library on GitHub.

And that’s a wrap.

We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most useful here. The complete PythonPro archives can be found here.

If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just leave a comment below.