PythonPro #67: PEP 751 Lock Files, Discord RAT, Prompt Toolkit 3.0, and Async Python at Duolingo

Apr 03, 2025

Welcome to a brand new issue of PythonPro!

News Highlights: Python adopts PEP 751 for standardized lock files; new Discord-based Python RAT steals credentials; Prompt Toolkit 3.0 adds rich CLI features; and OpenAI Agents SDK gains MCP support for external tool access.

My top 5 picks from today’s learning resources:

We hacked Google’s A.I Gemini and leaked its source code (at least some part)🕵️‍♂️
Share Python Scripts Like a Pro: uv and PEP 723 for Easy Deployment📦
How to use Hinge Loss and Squared Hinge Loss with Keras📉
How we started our async python migration🚀
Python’s ‘shelve’ is really useful for LLM debugging🗄️

And, in From the Cutting Edge, we introduce Freyja, a lightweight Python library for scalable data discovery in data lakes, enabling efficient join discovery and data augmentation by profiling attributes and predicting joinability without heavy infrastructure or deep learning models.

Stay awesome!

Divya Anne Selvaraj

Editor-in-Chief

Sign Up | Advertise

🐍 Python in the Tech 💻 Jungle 🌳

🗞️News

New Python lock file format will specify dependencies: Python has officially adopted PEP 751, introducing a universal, human-readable lock file format to standardize dependency specification for reproducible installs.
New Python-Based Discord RAT Attacking Users to Steal Login Credentials: The RAT uses Discord as its command-and-control channel to steal browser credentials, execute system commands, and capture screenshots.
Python Prompt Toolkit 3.0: The latest version Python library for building advanced interactive command-line applications includes features like syntax highlighting, autocompletion, multiline editing, and full-screen UI support.
OpenAI adds Model Context Protocol (MCP) support to Agents SDK: MCP is a standard for connecting LLMs to external tools and data sources using local or remote servers..
Big improvements to checkpoint performance in latest LangGraph Python: The latest release of LangGraph for Python (langgraph 0.3.21) achieves up to 1.7x faster checkpoint performance based on benchmark graph.

💼Case Studies and Experiments🔬

We hacked Google’s A.I Gemini and leaked its source code (at least some part): Demonstrates advanced LLM red-teaming techniques, sandbox inspection, and secure code exploitation—useful for developers working with AI sandboxes, custom interpreters, or secure system integrations.
Smuggling Python Code Using Shrugging Faces: Demonstrates how Python code can be covertly embedded within a single emoji using zero-width joiner sequences, effectively smuggling a working REPL inside what appears to be a shrugging face.

📊Analysis

Shadowing in Python gave me an UnboundLocalError: A personal account of encountering and resolving a common Python error with context, reflection, and a small illustrative example.
Democratizing AI Compute, Part 7: What about Triton and Python eDSLs?: Examines Python-based embedded domain-specific languages (eDSLs) like Triton as a means to combine Python’s ease with GPU-level control for AI workloads.

🎓 Tutorials and Guides 🤓

📖Open Access Book | Architecture Patterns with Python: Introduces architectural patterns for building testable, maintainable Python applications using TDD, DDD, and event-driven principles.
Share Python Scripts Like a Pro: uv and PEP 723 for Easy Deployment: Shows you how to create and share self-contained Python scripts with embedded dependencies for hassle-free deployment across systems.
How to build Hot Module Replacement (HMR) in Python: Shows you how to HMR using a dependency map to reload only affected modules instead of restarting the entire process.
Writing a 6502 emulator in Python: Explains how CPUs work by building an emulator, deepening your understanding of processor architecture, memory access, and instruction flow.
The Dark Side of Python’s pickle – How to Backdoor an AI Model: Walks you through how pickle works, how vulnerabilities can be exploited, and how to mitigate them, all with detailed technical examples.
Specializing Python with E-graphs: Shows you how to build a Python expression compiler using e-graphs and MLIR. Demonstrates symbolic rewriting with Egglog, compiles optimized NumPy-style code to MLIR, lowers it to LLVM, and executes it via JIT.
How to use Hinge Loss and Squared Hinge Loss with Keras: Walks you through dataset generation, model architecture, training setup, and performance visualization on a non-linear dataset using TensorFlow 2.

🔑Best Practices and Advice🔏

How we started our async python migration: Demonstrates how Duolingo migrated a Python microservice from synchronous to asynchronous execution to improve performance and reduce infrastructure costs and achieved a 40% increase in request handling per instance.
Using overload to handle tagged union return types: Shows you how to improve static type checking in Python when functions return tagged unions based on input types, using typing.overload and Literal .
Operationalizing Python – Part 1: Why Does It Hurt?: Explains why Python codebases degrade in large organizations and how aligned tooling—not strict rules—can guide teams toward maintainability.
Claude 3.7 meta-reflects on Clojure vs Python: Explains how AI coding assistants perform better in structured, functional environments and how using AI in architectural discussions, TDD, and documentation can improve open-source project quality.
Python’s ‘shelve’ is really useful for LLM debugging: Shows how Python’s built-in shelve module can act as a persistent key-value store and help to avoid redundant API calls by caching model responses as pickled Python objects for fast, low-cost reuse.

🔍From the Cutting Edge: Supporting Data Discovery Tasks at Scale with Freyja💥

In "Supporting Data Discovery Tasks at Scale with Freyja," Marc Maynou and Sergi Nadal introduce Freyja, a scalable data discovery system developed to support join discovery and data augmentation tasks within large and heterogeneous data lakes, released as a lightweight Python library.

Context

Data lakes are vast, schema-flexible repositories where different stakeholders contribute datasets of varying structure and semantics. In contrast to traditional data warehouses, which follow a model-first integration approach, data lakes adopt a load-first strategy, making data easier to ingest but harder to discover and relate.

Data discovery refers to the automatic identification of relevant datasets that can be combined for analysis. A core sub-task is join discovery, which aims to find attributes from different datasets that can be meaningfully joined. This is often used for data augmentation, where new features are added to training datasets to improve machine learning models.

Conventional approaches to join discovery are either too simplistic (e.g., relying on exact value overlaps) or too resource-intensive (e.g., involving deep learning or semantic embeddings). Freyja is introduced as a middle ground—semantically aware, but efficient and easy to deploy.

Key Features of Freyja

Attribute profiling: Generates compact representations of columns based on 62 profile features (e.g., entropy, value frequency, string lengths), using analytical databases like DuckDB for efficiency.
Joinability ranking: Predicts how well a column can be joined with others using a pretrained model that analyses profile similarities.
Data augmentation: Enables automated joining of datasets based on the predicted rankings to enrich data used in downstream models.
Scalability and portability: Profiles are small in size and independent, allowing the entire pipeline to run in-memory with linear scalability.
Ease of use: Integrated into notebooks with minimal setup, it is accessible to both novice and experienced data scientists.

What This Means for You

Freyja is highly relevant for data scientists and machine learning practitioners who work in organisations with extensive data lakes but limited infrastructure. It simplifies complex data discovery tasks without sacrificing accuracy or scalability. For public sector analysts, researchers, or commercial data teams looking to enrich datasets without extensive engineering overhead, Freyja provides a practical and portable solution.

Examining the Details

Freyja replaces computationally expensive set-overlap checks with a predictive model trained on a large corpus of attribute pairs with known joinability values. This model assesses the distance between attribute profiles—vectors of normalised feature values—and produces a continuous joinability score. Because profiles are standardised and comparisons are made via Z-score normalisation, the approach is robust to the heterogeneity typical of data lakes.

The system’s architecture ensures that profiles are computed only once per attribute and reused across analyses. Freyja’s profiling avoids numeric columns due to their low join potential and instead targets categorical and textual attributes.

In demonstration, Freyja significantly improved model accuracy through data augmentation. For example, augmenting a rental price prediction dataset with just one additional attribute reduced the root mean squared error by nearly half—from 76.44 to 39.19.

You can learn more by reading the entire paper or accessing the library on GitHub.

And that’s a wrap.

We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most useful here. The complete PythonPro archives can be found here.

If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just leave a comment below.