PythonPro #38: Streamline Migrations with PyMigTax, PyTorch's tensors vs. NumPy's ndarrays, and PythonMonkey Bridges Codebases

Bite-sized actionable content, practical tutorials, and resources for Python programmers and data scientists.

Jul 17, 2024

Welcome to a brand new issue of PythonPro!

In today’s Expert Insight we bring you an excerpt from the recently published book,

Modern Computer Vision with PyTorch - Second Edition, which highlights the advantages of PyTorch's tensors over NumPy's ndarrays, especially in performance when utilizing a GPU.

News Highlights: Ruff dominates Python linting with Rust, gaining massive weekly downloads; PythonMonkey merges Python and JavaScript for streamlined codebases; and CRuby outperforms CPython, reducing syscalls for better efficiency.

Here are my top 5 picks from our learning resources today:

Querying 1TB on a laptop with Python dataframes💻
Split Your Dataset With scikit-learn's train_test_split()🔪
Instrumenting Python GIL with eBPF🔧
Bloat beneath Python’s Scales - A Fine-Grained Inter-Project Dependency Analysis⚖️
Enterprise Python🏢

And, in today’s Featured Study, we introduce a new taxonomy aimed at simplifying Python library migrations.

Stay awesome!

Divya Anne Selvaraj

Editor-in-Chief

P.S.: This month’s survey is still on. Do take the opportunity to tell us what you think of PythonPro, request learning resources, and earn your one Packt Credit for this month.

Sign Up | Advertise

🐍 Python in the Tech 💻 Jungle 🌳

🗞️News

The Python Linter Ruff Is a Win for Open Source — and Rust: Since its inception in 2022, Ruff has rapidly gained popularity, now achieving millions of downloads per week and supporting hundreds of lint rules. Read to learn about Ruff’s role in consolidating Python tooling.
Python Meets JavaScript, Wasm With the Magic of PythonMonkey: This tool reduces code maintenance by allowing a single codebase for NodeJS and Python projects. Read to learn about the capabilities and advantages of PythonMonkey.
CRuby writes files with 40% fewer syscalls than CPython?: Introduces a new version of Cirron which measures a piece of Python or Ruby code and reports back several performance counters. Read for insights to drive your language choice.

💼Case Studies and Experiments🔬

Binary secret scanning helped us prevent (what might have been) the worst supply chain attack you can imagine: Details a critical incident where the JFrog Security Research team discovered a leaked GitHub access token. Read to learn how proactive security measures can avert significant threats.
Querying 1TB on a laptop with Python dataframes: Details the process and results of benchmarking Python dataframe libraries like DuckDB, DataFusion, and Polars on a MacBook Pro with 96GB of RAM, using a 1TB TPC-H dataset. Read for insights into handling extensive data on relatively standard hardware.

📊Analysis

Embedded Python - MicroPython is amazing: Celebrates MicroPython's 11th anniversary, detailing its versatility and impact in embedded systems. Read to learn about the practical applications and significant advantages of using MicroPython.
Python has too many package managers: Critiques the Python package management ecosystem. Read to learn about the comparative advantages and drawbacks of tools like pip, poetry, conda, and Rust-based solutions like uv and pixi.

🎓 Tutorials and Guides 🤓

How Do You Choose Python Function Names?: Emphasizes the importance of selecting descriptive Python function names using snake_case. Read to learn why adherence to these guidelines minimizes bugs and improves code quality.
Split Your Dataset With scikit-learn's train_test_split(): Explores the critical role of data splitting in supervised machine learning. Read to learn how to divide datasets into training and testing subsets to minimizing evaluation bias.
Instrumenting Python GIL with eBPF: In this article, eBPF is utilized to attach to the take_gil function in Python's code, recording the time taken for threads to acquire the GIL. Read to learn how to assess GIL’s effects on app performance.
Robust Exception Handling in Python - An Expert Guide: Covers fundamentals, best practices, real-world tips, and a technical deep dive into Python's exception mechanisms. Read to learn how to gracefully manage and recover from errors.
Resource management and generators in Python: Focuses on the cleanup challenges associated with both synchronous and asynchronous generators. Read to learn how explicit management can mitigate delayed cleanup.
A Guide to Python's Weak References Using weakref Module: Introduces Python's weakref module, explaining its role in managing memory. Read to learn how and when to use weak references effectively in your code.
Making Python Less Random: Details experiences with Python's os.urandom and random.randint, and the complexity introduced by random elements from a third-party library. Read to learn advanced methods to control randomness.

🔑Best Practices, Advice, and Code Optimization🔏

Enterprise Python: Explores the evolution of enterprise software and Python's rising role in it, as delivered in a talk at Europython 2024. Read to learn about the significant changes in enterprise software development.
Proxy Objects in Python: Proxy objects are utilized in various programming scenarios like database connection management and filesystem abstraction. Read to learn how they simplify complex interactions and enhance code modularity.
Bloat beneath Python’s Scales - A Fine-Grained Inter-Project Dependency Analysis: Investigates dependency bloat within the PyPI ecosystem and finds that over 50% of dependencies are bloated, with 15% of defects located in these bloated areas. Read to learn how bloating occurs and more.
Time travel with Python: Discusses testing time-based functionality in Python, detailing methods like using unittest.mock and the Freezegun library. Read to learn how to ensure your time-dependent code performs as expected.
Python task queue latency: Highlights how design choices like polling versus blocking and the extent of acknowledgments affect latency. Read to learn about the trade-offs between ease of configuration and latency in Python task queues.

🔍Featured Study: Decoding Python Library Migrations with PyMigTax💥

In their comprehensive study, "Characterizing Python Library Migrations," Islam et al. delve into the intricacies of library migration within Python projects. This research, conducted across universities in Canada and the USA, introduces a novel taxonomy, PyMigTax, aimed at simplifying these migrations.

Context

Python library migration, a critical aspect of software development, involves replacing one software library with another. Often necessitated by the need for improved functionality, better performance, or enhanced security, these migrations can significantly impact the maintenance and upgradeability of applications. Given Python's widespread use and its dynamic ecosystem of libraries, understanding these migrations is vital for software sustainability.

Key Points

Extensive Empirical Analysis: The study meticulously analyses 3,096 migration-related code changes across 335 migrations from 311 client repositories, offering a solid empirical foundation.
Introduction of PyMigTax: PyMigTax is a newly developed taxonomy that categorizes types of code changes during library migrations, providing a structured approach to understanding migration challenges.
Detailed Insights into Migration Dynamics: The research highlights that:
- 40% of library pairs involve migrations that require modifications beyond function-based program elements.
- On average, a developer needs to learn about 4 new APIs and modify 8 lines of code per migration.
- Existing automated tools often fail to support many of the more complex migration scenarios encountered.

What This Means for You

Enhanced Migration Strategy: By utilizing the PyMigTax taxonomy, you can better anticipate the breadth of changes involved and strategize their migrations more effectively.
Optimization of Resources: Understanding the typical workload and challenges of migrations enables better planning and resource allocation.
Access to Resources: The study makes all developed tools, data, and methodologies publicly available on Figshare, equipping you with essential resources to manage migrations more adeptly.

Examining the Details

The study employed an empirical methodology, analyzing real-world Python library migrations to develop a comprehensive understanding of the migration process. Data was sourced from:

PyMigBench-2.0, a dataset containing verified migrations.
Manually labeled migration-related code changes from a wide array of domains and library pairs.

You can learn more by reading the entire paper.

Take the Survey, Get a Packt Credit!

🧠 Expert insight 📚

Modern Computer Vision with PyTorch - Second Edition, Published by Packt, Book Cover

Here’s an excerpt from “Chapter 2: PyTorch Fundamentals” in the book, Modern Computer Vision with PyTorch - Second Edition, by V Kishore Ayyadevara and Yeshwanth Reddy, published in June 2024.

Advantages of PyTorch’s tensors over NumPy’s ndarrays

…when calculating the optimal weight values, we vary each weight by a small amount and understand its impact on reducing the overall loss value. Note that

the loss calculation based on the weight update of one weight does not impact the loss calculation of the weight update of other weights in the same iteration. Thus, this process can be optimized if each weight update is made by a different core in parallel instead of updating weights sequentially. A GPU comes in handy in this scenario, as it consists of thousands of cores when compared to a CPU (which, in general, could have <=64 cores).

A Torch tensor object is optimized to work with a GPU compared to NumPy. To understand this further, let’s perform a small experiment, where we perform the operation of matrix multiplication using NumPy arrays in one scenario and tensor objects in another, comparing the time taken to perform matrix multiplication in both scenarios:

Note:

The following code can be found in the Numpy_Vs_Torch_object_computation_speed_comparison.ipynb file in the Chapter02 folder of this book’s GitHub repository.

Generate two different torch objects:
import torch
x = torch.rand(1, 6400)
y = torch.rand(6400, 5000)
Define the device to which we will store the tensor objects we created in step 1:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
Note that if you don’t have a GPU device, the device will be cpu (furthermore, you would not notice the drastic difference in time taken to execute when using a CPU).
Register the tensor objects that were created in step 1 with the device (registering tensor objects means storing information in a device):
x, y = x.to(device), y.to(device)
Perform matrix multiplication of the Torch objects, and also time it so that we can compare the speed to a scenario where matrix multiplication is performed on NumPy arrays:
%timeit z=(x@y)
# It takes 0.515 milli seconds on an average to
# perform matrix multiplication
Perform matrix multiplication of the same tensors on cpu:
x, y = x.cpu(), y.cpu()
%timeit z=(x@y)
# It takes 9 milli seconds on an average to
# perform matrix multiplication
Perform the same matrix multiplication, this time on NumPy arrays:
import numpy as np
x = np.random.random((1, 6400))
y = np.random.random((6400, 5000))
%timeit z = np.matmul(x,y)
# It takes 19 milli seconds on an average to
# perform matrix multiplication

You will notice that the matrix multiplication performed on Torch objects on a GPU is ~18X faster than Torch objects on a CPU, and ~40X faster than the matrix multiplication performed on NumPy arrays. In general, matmul with Torch tensors on a CPU is still faster than NumPy. Note that you will notice this kind of speed increase only if you have a GPU device. If you are working on a CPU device, you will not notice the dramatic increase in speed. This is why if you do not own a GPU, we recommend using Google Colab notebooks, as the service provides free GPUs.

Packt library subscribers can continue reading the entire book for free. You can buy Modern Computer Vision with PyTorch - Second Edition, by V Kishore Ayyadevara and Yeshwanth Reddy, here.

Get the book!

On a scale of 1-10, how would you rate today’s issue of PythonPro in terms of being informative, engaging, and useful?
lowest 1 2 3 4 5 6 7 8 9 10 highest
Sorry, voting is closed.

And that’s a wrap.

We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most useful here. The complete PythonPro archives can be found here.

If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, take the survey or leave a comment below!

Discussion about this post

Ready for more?