PythonPro #56: Pandas Vectorized Operations, PyPI Deletion Rules, and ChatGPT vs. Gemini Accuracy Showdown

Nov 20, 2024

Welcome to a brand new issue of PythonPro!

In today’s Expert Insight we bring you an excerpt from the recently published book, Pandas Cookbook - Third Edition, which emphasizes the importance of using vectorized operations in pandas for better performance compared to Python loops.

News Highlights: Technion launches PyPIM for in-memory computing in Python; PEP 763 limits PyPI deletions to 72 hours post-upload; and ColiVara API enables advanced document retrieval with visual embeddings.

My top 5 picks from today’s learning resources:

And, today’s Featured Study, evaluates AI programming tools ChatGPT, Gemini, AlphaCode, and GitHub Copilot, highlighting ChatGPT's highest single-attempt accuracy (87.2% pass@1), and Gemini's strong multi-attempt performance.

Stay awesome!

Divya Anne Selvaraj

Editor-in-Chief

P.S.: This month's survey is still live. Do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.

Sign Up | Advertise

🐍 Python in the Tech 💻 Jungle 🌳

🗞️News

Researchers develop Python code for in-memory computing — in-memory computation comes to Python code: Technion researchers have developed PyPIM, a tool that translates Python code into machine code for in-memory computing, enabling programmers to use Python without adaptation.
PEP 763 – Limiting deletions on PyPI: PEP 763 proposes restricting the deletion of files, releases, and projects on PyPI to within 72 hours of upload, with exceptions for pre-release specifiers.
ColiVara – State of the Art RAG API with Vision Models: ColiVara is a Python-based API and suite of tools for state-of-the-art document retrieval using visual embeddings, designed as a web-first implementation of the ColPali paper.

💼Case Studies and Experiments🔬

Any Python program fits in 24 characters*: Demonstrates how to encode any Python program in 24 characters (excluding whitespace) by exploiting the flexibility of whitespace encoding and Unicode representations.
Judge a Book by its Color: How to Visualize Decades of Book Cover Colors from Scratch— Scraping, Data, and Design: Explores six decades of bestseller book cover colors using web scraping, ColorThief, and other libraries.

📊Analysis

A pocket calculator using lambdatalk vs. Python: Compares building a browser-based pocket calculator using the lightweight functional programming language lambda talk with the Python-to-JavaScript transpiler Brython.
Building a macOS app with python - BeeWare vs Kivy vs Flet vs Swift: Compares Python GUI frameworks BeeWare, Kivy, and Flet with Swift for building a macOS voice cloning app using the F5 TTS MLX model.

🎓 Tutorials and Guides 🤓

📽️Flash Attention derived and coded from first principles with Triton (Python): Provides a comprehensive tutorial on deriving and coding Flash Attention from scratch, covering mathematical foundations, CUDA, and Triton.
Mastering Bivariate Maps with Plotly: A Step-by-Step Guide: Covers data generation, normalization, creating custom legends, and interactive map visualization, offering insights into crafting informative and visually appealing geospatial representations.
1969: Can You Land on The Moon? • A Python `turtle` Lunar Lander: Demonstrates how to create a lunar landing game using Python’s turtle module, simulating realistic physics and controls for landing a lunar module.
Generating realistic IoT data using Python & storing into MongoDB Timeseries Collection. Part 1: Guides you through generating realistic IoT sensor data streams using Python and storing them in MongoDB Time Series Collections.
Vector animations with Python: A notebook demonstrating how to create dynamic vector animations in Python using Gizeh for vector graphics and MoviePy for animation.
Dependent Types in 200 Lines of Python: Demonstrates building a type checker for the Calculus of Constructions (CoC) in Python, illustrating dependent types, type polymorphism, and precise type guarantees.
Data in the Small: Python package littletable combines in-memory NoSQL ORM with schemaless setup(and easy CSV import/export): Introduces littletable, a lightweight Python package for in-memory NoSQL ORM with a schema-less setup, offering SQL-like features.

🔑Best Practices and Advice🔏

5 Overrated Python Libraries (And What You Should Use Instead): Critiques Requests, BeautifulSoup, Pandas, Matplotlib, and Scikit-Learn as outdated or inefficient for modern needs, and suggests alternatives.
Python Dictionary Comprehensions: How and When to Use Them: Covers creating dictionaries from iterables, transforming existing ones, and filtering key-value pairs with conditionals, while also advising on best practices.
Using the Python zip() Function for Parallel Iteration: Covers key concepts such as zip() 's lazy evaluation, handling unequal-length iterables, and using zip() to build dictionaries, alongside techniques like unzipping sequences.
Using the len() Function in Python: Delves into applying len() to built-in and third-party types, like NumPy arrays and pandas DataFrames, as well as extending its functionality to user-defined classes via the .__len__() method.
Attemtps at immutability with dataclasses in Python: Explores achieving immutability in Python through various methods, comparing old-style constants, new-style constants, dataclasses, enums, namedtuples, and metaprogramming.

🔍Featured Study: Programming with AI💥

In "Programming with AI: Evaluating ChatGPT, Gemini, AlphaCode, and GitHub Copilot for Programmers," Siam, Gu, and Cheng, compare four leading AI-powered tools for programming. The researchers from the New York Institute of Technology, aim to assess the tools' code-generation accuracy, capabilities, and implications for developers using rigorous benchmarks and evaluation metrics.

Context

LLMs like ChatGPT, Gemini, AlphaCode, and GitHub Copilot use transformer architectures to process natural language and generate programming code. Tools such as these are revolutionising software development by automating code creation and assisting with problem-solving tasks. The study’s relevance lies in its comprehensive evaluation of their accuracy, efficiency, and potential to transform programming workflows. Metrics like pass@k (accuracy over k attempts) and test case pass rates (functional correctness) provide critical insight into the models' capabilities.

Key Findings

ChatGPT: GPT-4-Turbo-0125 achieved the highest accuracy (87.2% pass@1) on HumanEval, outperforming other models in single-attempt code generation.
Gemini: Gemini-1.5-Pro scored 74.9% on HumanEval, while Gemini-Ultra excelled in multiple-attempt scenarios with a 74.7% pass@100 on Natural2Code.
AlphaCode: Designed for competitive programming, AlphaCode achieved pass rates of 54% (Python), 51% (Java), and 45% (C++) on Codeforces challenges.
GitHub Copilot: On LeetCode, Copilot attained test case pass rates of 75.7% (Java) and 73.3% (C++), enhancing productivity by offering real-time code suggestions.
Ethical Issues: Models exhibit biases in outputs, risk copyright infringement, and occasionally produce plausible but incorrect code. GitHub Copilot, in particular, has faced criticism over intellectual property concerns.

What This Means for You

The study is particularly valuable for programmers, software engineers, and organisations using AI tools to streamline coding tasks. It highlights which tools excel in accuracy and productivity, enabling developers to make informed decisions based on their specific needs, such as competitive programming (AlphaCode) or real-time coding assistance (GitHub Copilot). Ethical concerns warrant careful oversight when using these tools in professional environments.

Examining the Details

The study uses empirical methods, analysing performance across benchmarks like HumanEval, Codeforces, and Natural2Code. Metrics such as pass@1, pass@100, and test case pass rates were applied to ensure rigorous evaluation. By referencing 10 recent research papers, it validates the models' capabilities and relevance. However, the study also emphasises limitations, including computational costs and the need for human oversight due to occasional inaccuracies. Despite these challenges, the findings are robust, demonstrating how AI tools are reshaping the future of programming.

You can learn more by reading the entire paper.

🧠 Expert insight💥

Here’s an excerpt from “Chapter 10: General Usage and Performance Tips” in the Pandas Cookbook - Third Edition by William Ayd and Matthew Harrison, published in October 2024.

Use vectorized functions instead of loops

Python as a language is celebrated for its looping prowess. Whether you are working with a list or a dictionary, looping over an object in Python is a relatively easy task to perform, and can allow you to write really clean, concise code.

Even though pandas is a Python library, those same looping constructs are ironically an impediment to writing idiomatic, performant code. In contrast to looping, pandas offers vectorized computations, i.e, computations that work with all of the elements contained within a pd.Series but which do not require you to explicitly loop.

How to do it

Let’s start with a simple pd.Series constructed from a range:

ser = pd.Series(range(100_000), dtype=pd.Int64Dtype())

We could use the built-in pd.Series.sum method to easily calculate the summation:

ser.sum()

4999950000

Looping over the pd.Series and accumulating your own result will yield the same number:

result = 0
for x in ser:
result += x
result

4999950000

Yet the two code samples are nothing alike. With pd.Series.sum , pandas performs the summation of elements in a lower-level language like C, avoiding any interaction with the Python runtime. In pandas speak, we would refer to this as a vectorized function.

By contrast, the for loop is handled by the Python runtime, and as you may or may not be aware, Python is a much slower language than C.

To put some tangible numbers forth, we can run a simple timing benchmark using Python’s timeit module. Let’s start with pd.Series.sum :

timeit.timeit(ser.sum, number=1000)

0.04479526499926578

Let’s compare that to the Python loop:

def loop_sum():
result = 0
for x in ser:
result += x
timeit.timeit(loop_sum, number=1000)

5.392715779991704

That’s a huge slowdown with the loop!

Generally, you should look to use the built-in vectorized functions of pandas for most of your analysis needs. For more complex applications, reach for the .agg , .transform , .map , and .apply methods, which were covered back in Chapter 5, Algorithms and How to Apply Them. You should be able to avoid using for loops in 99.99% of your analyses; if you find yourself using them more often, you should rethink your design, more than likely after a thorough re-read of Chapter 5, Algorithms and How to Apply Them.

The one exception to this rule where it may make sense to use a for loop is when dealing with a pd.GroupBy object, which can be efficiently iterated like a dictionary:

df = pd.DataFrame({
"column": ["a", "a", "b", "a", "b"],
"value": [0, 1, 2, 4, 8],
})
df = df.convert_dtypes(dtype_backend="numpy_nullable")
for label, group in df.groupby("column"):
print(f"The group for label {label} is:\n{group}\n")

The group for label a is:
column value
0 a 0
1 a 1
3 a 4
The group for label b is:
column value
2 b 2
4 b 8

Pandas Cookbook - Third Edition was published in October 2024.

Get the eBook for ~~$39.99~~ $27.98

Get the Print Book for $49.99

And that’s a wrap.

We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most useful here. The complete PythonPro archives can be found here.

If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, take the survey or leave a comment below!