PythonPro #50: Python 3.13 Arrives, Offensive Security Practices, and Jupyter Notebook Tips

Oct 10, 2024

Welcome to a brand new issue of PythonPro!

In today’s Expert Insight we bring you an excerpt from the recently published book, Offensive Security Using Python, which briefly discusses key practices such as input validation, secure authentication, session management, secure coding techniques, and the implementation of security headers.

News Highlights: Python 3.13.0, released yesterday, adds an interactive interpreter, free-threaded mode, JIT compiler, and iOS/Android support; and Rev's Reverb models for ASR and diarization outperform other open-source models.

Here are my top 5 picks from our learning resources today:

And, today’s Featured Study, evaluates the performance of AI models in geospatial code generation, revealing significant challenges in handling complex tasks, specific data formats, and specialised libraries.

Stay awesome!

Divya Anne Selvaraj

Editor-in-Chief

P.S.: This month's survey is still live, do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.

Sign Up | Advertise

🐍 Python in the Tech 💻 Jungle 🌳

🗞️News

Python 3.13.0 Is Released: Released on October 7, 2024, the version includes a new interactive interpreter, free-threaded mode, and JIT compiler, and support for iOS and Android platforms.
Introducing Reverb: The Future of Open-Source automatic speech recognition (ASR) and Diarization: Rev's new open-source models for ASR and speech diarization, built using Rev’s extensive human-transcribed English speech dataset, outperforms existing open-source models.

💼Case Studies and Experiments🔬

Using Kolmogorov-Arnold Networks (KAN) and Backtesting to Predict Stock Prices: Discusses predicting stock prices, focusing on deep learning models trained on historical data from Yahoo Finance.
🎥Marketing Media Mix Models with Python & PyMC: a Case Study [PyCon DE & PyData Berlin 2024]: discusses how machine learning models can optimize marketing investments by analyzing various channels.

📊Analysis

10 Jupyter Notebook Features You Didn’t Know Exist: Discusses features including magic commands, interactive widgets, auto-reload for modules, in-notebook documentation, and collapsible headings.
I Used Claude.ai to Create a Discord Bot — Here’s What I Learned About the State of AI Code Writing: Discusses the author's experience using Claude to rapidly generate Python code for a bot that deletes old Discord messages.

🎓 Tutorials and Guides 🤓

A Guide to Modern Python String Formatting Tools: Explains how to format values, create custom format specifiers, and embed expressions in strings. Read to learn practical techniques for dynamic string manipulation.
DuckDB in Python in the Browser with Pyodide, PyScript, and JupyterLite: Shows you how to run DuckDB in Python within a browser environment and embed interactive Python environments in web pages.
Tutorial: Creating a Twitter (X) Bot using Python: Explains how to build and deploy a Python-based Twitter (X) bot that autonomously tweets updates, including progress graphs, using the X API.
Distilling python functions into LLM: Explains how to use the Instructor library to distill Python functions into a language model, enabling fine-tuning for function emulation using Pydantic type hints.
Getting Started with Powerful Data Tables in Your Python Web Apps: Demonstrates building a finance app that fetches stock data, displays it interactively, and includes features like sorting, and graph visualization.
Modeling customers decisions in Python with the Choice-Learn package: Introduces the Choice-Learn Python package, which simplifies implementing discrete choice models like Conditional Logit to predict customer decisions.
Optimizing Inventory Management with Reinforcement Learning: A Hands-on Python Guide: Outlines how Q-learning helps balance holding and stockout costs by developing an optimal ordering policy.

🔑Best Practices and Advice🔏

Speeding up CRC-32 calculations in Mojo: Discusses speeding up CRC-32 calculations in Mojo, achieving an 18x improvement over Python's native implementation and reaching 3x slower performance compared to zlib library.
Bad Schemas could break your LLM Structured Outputs: Explains how choosing the right response model dramatically impacts the performance of language models like GPT-4o and Claude, especially when using JSON mode or Tool Calling.
Implementing a Python Singleton with Decorators: Explains how a decorator ensures only one instance of a class is created, using a _SingletonWrapper class to handle instantiation and simplifies global access.
🎥Best practices for securely consuming open source in Python — Ciara Carey: Introduces a framework called Secure Supply Chain Consumption Framework (S2C2F) to help organizations improve open-source security.
Understanding Logarithmic Plots in Matplotlib: semilogx, semilogy, and loglog: Walks you through plotting data with a logarithmic x-axis, y-axis, and both axes, respectively, and provides code snippets to generate these plots.

🔍Featured Study: Current AI Models Fall Short in Geospatial Code Generation💥

In "Evaluation of Code LLMs on Geospatial Code Generation," Gramacki et al. introduce a benchmark to assess LLMs' ability to handle tasks involving spatial reasoning and data processing.

Context

LLMs generate code based on natural language inputs and are effective in general programming tasks, particularly in data science. Geospatial data science is a field focused on analysing spatial data tied to locations. It relies on libraries like GeoPandas and Shapely for tasks such as geo-coding, spatial analysis, and data visualisation. However, the domain poses unique challenges for LLMs due to the need for spatial reasoning and the use of specialised tools, making evaluation in this area crucial. As geospatial applications expand in industries such as urban planning and environmental science, reliable AI assistance is becoming increasingly important.

Key Findings

LLMs underperform in geospatial tasks: Models like Code Llama and Starcoder2 show reduced accuracy compared to their performance in general coding.
Starcoder2-7B leads but struggles: It achieved a pass@1 score of 32.47%, highlighting the difficulty of geospatial tasks even for top-performing models.
Complex tasks pose a challenge: Single-step tasks had a 45.45% pass@1 success rate, but multi-step tasks were far more difficult, scoring only 15.15%.
Data format matters: Models handled GeoDataFrames better than other formats like GeoJSON, showing varying levels of tool proficiency.
Limited tool support: Libraries like MovingPandas and OSMNX, crucial for geospatial analysis, were inadequately supported by the models.

What This Means for You

This study is relevant for geospatial programmers and data scientists seeking to automate coding tasks. Current LLMs are not yet reliable for complex geospatial tasks, highlighting a need for models specifically trained for the domain. Developers and researchers can benefit by focusing on improving AI models to better support geospatial data science workflows.

Examining the Details

The authors created a benchmark dataset categorising tasks by complexity, data format, and tool usage. The dataset includes 77 samples to test LLM performance on tasks like spatial reasoning and tool implementation. Evaluation metrics focused on accuracy and pass@1, with the results highlighting the models' struggles in handling geospatial problems. Libraries like GeoPandas and H3 were used to evaluate the models, while more complex tools like MovingPandas exposed the models' weaknesses.

This rigorous benchmark, publicly available for future research, sets a foundation for improving geospatial code generation in LLMs. The study’s methodology ensures it reflects real-world geospatial coding challenges, offering valuable insights for the development of more domain-specific AI tools.

You can learn more by reading the entire paper and accessing the benchmark dataset: geospatial-code-llms-dataset.

🧠 Expert insight💥

Here’s an excerpt from “Chapter 3: An Introduction to Web Security with Python” in the book, Offensive Security Using Python by Rejah Rehim and Manindar Mohan, published in September 2024.

Proactive web security measures with Python

Python has developed as a versatile widely used programming language in the field of modern software development. Its ease of use, readability, and rich library support have made it a popular choice for developing

web-based applications in a variety of industries. Python frameworks such as Django, Flask, and Pyramid have enabled developers to create dynamic and feature-rich web applications with speed and agility.

However, as Python web apps become more popular, there is a corresponding increase in the sophistication and diversity of attacks targeting these applications. Cybersecurity breaches can jeopardize valuable user data, interfere with corporate operations, and damage an organization’s brand. Python web applications become vulnerable to a variety of security vulnerabilities, including SQL injection, XSS, and cross-site request forgery (CSRF). The consequences of these vulnerabilities can be severe, demanding an effective cybersecurity strategy.

Developers must be proactive to counteract this. By implementing security practices such as input validation, output encoding, and other secure coding guidelines early in the development lifecycle, developers can reduce the attack surface and improve the resilience of their Python web applications.

Although we are only discussing Python-based applications here, these practices are universal and should be implemented in web applications built with any technology stack.

To protect against a wide range of cyber threats, it is critical to implement strong best practices. This section explains key security practices that developers should follow while developing web apps.

Input validation and data sanitization

User input validation is essential for preventing code injection attacks. Malicious inputs can exploit vulnerabilities and cause unwanted commands to be executed. Proper data sanitization guarantees that user inputs are handled as data rather than executable code by eliminating or escaping special characters. Using libraries such as input() and frameworks such as Flask’s request object can help validate and sanitize incoming data.

Secure authentication and authorization

Restricting unauthorized access requires effective authentication and authorization procedures. Password hashing, which uses algorithms such as bcrypt or Argon2 , adds an extra degree of security by ensuring that plaintext passwords are never saved. Two-factor authentication (2FA) adds an additional verification step to user authentication, increasing security. Role-Based Access Control (RBAC) allows developers to provide specific permissions to different user roles, guaranteeing that users only access functionality relevant to their responsibilities.

Secure session management

Keeping user sessions secure is critical for avoiding session fixation and hijacking attempts. Using secure cookies with the HttpOnly and Secure characteristics prohibits client-side script access and ensures that cookies are only sent over HTTPS. Session timeouts and measures such as session rotation can improve session security even further.

Secure coding practices

Following secure coding practices reduces a slew of possible vulnerabilities. Parameterized queries, made possible by libraries such as sqlite3 , protect against SQL injection by separating data from SQL commands. Output encoding, achieved with techniques such as html.escape() , avoids XSS threats by converting user inputs to innocuous text. Similarly, omitting functions such as eval() and exec() avoids uncontrolled code execution, lowering the likelihood of code injection attacks.

Implementing security headers

Security headers are a fundamental component of web application security. They are HTTP response headers that provide instructions to web browsers, instructing them on how to behave when interacting with the web application. Properly configured security headers can mitigate various web vulnerabilities, enhance privacy, and protect against common cyber threats.

Here is an in-depth explanation of implementing security headers to enhance web application security:

Content Security Policy (CSP): CSP is a security feature that helps prevent XSS attacks. By defining and specifying which resources (scripts, styles, images, etc.) can be loaded, CSP restricts script execution to trusted sources. Implementing CSP involves configuring the Content-Security-Policy HTTP header in your web server. This header helps prevent inline scripts and unauthorized script sources from being executed, reducing the risk of XSS attacks significantly. An example of the CSP header is as follows:

Content-Security-Policy: default-src 'self'; script-src 'self' www.google-analytics.com;

HTTP Strict Transport Security (HSTS): HSTS is a security feature that ensures secure, encrypted communication between the web browser and the server. It prevents Man-in-the-Middle (MITM) attacks by enforcing the use of HTTPS. Once a browser has visited a website with HSTS enabled, it will automatically establish a secure connection for all future visits, even if the user attempts to access the site via HTTP. An example HSTS header is as follows:

Strict-Transport-Security: max-age=31536000; includeSubDomains; preload;

X-Content-Type-Options : The X-Content-Type-Options header prevents browsers from interpreting files as a different media type also known as a Multipurpose Internet Mail Extensions (MIME) type. It mitigates attacks such as MIME sniffing, where an attacker can trick a browser into interpreting content in an unintended way, potentially leading to security vulnerabilities. An example X-Content-Type-Options header is as follows:

X-Content-Type-Options: nosniff

X-Frame-Options : The X-Frame-Options header prevents clickjacking attacks by denying the browser permission to display a web page in a frame or iframe. This header ensures that your web content cannot be embedded within malicious iframes, protecting against UI redressing attacks. An example X-Frame-Options header is as follows:

X-Frame-Options: DENY

Referrer-Policy : The Referrer-Policy header controls what information is included in the Referrer header when a user clicks on a link that leads to another page. By setting an appropriate referrer policy, you can protect sensitive information, enhance privacy, and reduce the risk of data leakage. An example Referrer-Policy header is as follows:

Referrer-Policy: strict-origin-when-cross-origin

Packt library subscribers can continue reading the entire book for free. You can buy Offensive Security Using Python, here.

Get the eBook for ~~$39.99~~ $27.98!

Get the Print Book for ~~$49.99~~ $34.98!

Other Python titles from Packt at 30% off

In-Memory Analytics with Apache Arrow - Second Edition