PythonPro #1: Security and Immortal Objects in Python 3.12, Performance Optimization, and Metaclasses
Bite-sized actionable content, practical tutorials, and resources for Python programmers
“You primarily write your code to communicate with other coders, and, to a lesser extent, to impose your will on the computer.”
— Guido van Rossum (2019), The Mind at Work: Guido van Rossum on how Python makes thinking in code easier
Welcome to the very first issue of our Python focused newsletter featuring relevant news and useful learning resources to sharpen your Python skills.
With Python 3.12.0 releasing last week, it is the perfect time to dig deeper into the new release beyond just the feature update listings to deeper dives into security implications and fascinating things like immortal objects.
We also bring you a fresh roundup of tutorials. guides, and code optimization strategies from around the web including:
Advanced Python: Performance Optimization - 10 Common Scenarios and Strategies
Metaclasses in Python – A brief introduction to Python class object and how it is created
And to round it all up, we have an exclusive excerpt from the book Python Real-World Projects teaching you how to scrape data from webpages. So, without further ado, dive right in!
Stay awesome!
Divya Anne Selvaraj
Editor-in-Chief
P.S.: Thank you to everyone who participated in our discovery surveys. We want to continue to listen to you and understand how we can make PythonPro more tailored to your needs each week. So do take part in the survey at the end of the newsletter and as a thank you, you can download a free PDF of Expert Python Programming - 4th Edition when you finish.
🐍 Python in the Tech 💻 Jungle 🌳
🗞️News
Hundreds of malicious Python packages found stealing sensitive data: Malicious actors are planting data-stealing packages on open-source platforms, affecting thousands of downloads. Read to learn more about their tactics ranging from complex obfuscation techniques to app manipulation and explore the list of malicious packages used in the campaign for added vigilance.
Microsoft drops official support for Python 3.7 in Visual Studio Code: Despite the version’s ongoing popularity, relying on it unofficially (which is possible) may carry risks. Python 3.8 and the latest 3.12 are alternatives, but Python3.8's support will end in 2025 too. Read to learn more if you are considering upgrading.
💼Case Studies and Experiments🔬
Python Type Hints: pyastgrep case study: This article delves into the experiences of incorporating static typing into a real-world Python project. Read for a candid exploration of the intricacies involved from configuring mypy to handling third-party types, duck typing, and more.
Profiling Python and Ruby using eBPF: This article introduces Python support in Parca Agent for facilitating performance profiling. Read to understand the intricacies of profiling interpreted Python code, the need to navigate runtimes and abstract stacks, and to read abstract stack data from in-memory structures within Python runtimes.
📊Analysis
Python 3.12.0 from a supply chain security perspective: This article focuses on the supply chain security improvements in the new Python 3.12.0, including features like building source tar balls from a public commit for verifiability. The author also discusses the use of SBOMs to track changes in subcomponents between Python 3.11.6 and 3.12.0. Read to learn more about what these changes include and how they contribute to improved security and reliability in Python.
Understanding Immortal Objects in Python 3.12: A Deep Dive into Python Internals: In CPython, certain objects like None, True, and False were treated as global singletons, but they faced performance issues due to reference count updates. Immortalization which has been introduced in Python 3.12 marks these objects as 'immortal, 'preventing reference count changes which boosts performance. Read to learn more about the intricacies of Python's interpreter and its impact on performance and concurrency.
🎓 Tutorials and Guides 🤓
Metaclasses in Python – A brief introduction to Python class object and how it is created: This guide talks about how class objects are created, the role of new and init methods, and how to make classes callable with call. Read to discover the inner workings of metaclasses by explicitly specifying them.
Fine-Tuning Mistral7B on Python Code With A Single GPU!: Discover how to fine-tune Mistral 7B for Python code on a single GPU. Learn about its efficiency compared to larger models and explore tools like HuggingFace's Transformers library, DeepSpeed, and Choline. Read for stepwise instructions starting with data selection to training and inference.
Calling Rust from Python: Whether you seek performance improvements or need to integrate Rust into your Python workflow, this article presents three practical approaches: HTTP, Inter-Process Communication (IPC), and Foreign Function Interface (FFI). Read to understand these techniques and their trade-offs and access the complete source code on GitHub.
Create a custom user model in Django: This is a step-by-step guide, with access to the complete code on GitHub, will help you unlock the power of Django's custom user model to enhance authentication, streamline user information, and improve security. Read to learn how to create a custom user model in Django, use email as the unique identifier, create a custom User Model manager, and more.
Simplify legend handing in python: This article takes you through the process of simplifying complex legends and creating classed choropleth maps in geopandas. Read if you want to learn how to customize legend glyphs and create more appealing data visualization.
🔑 Best Practices and Code Optimization 🔏
Advanced Python: Performance Optimization - 10 Common Scenarios and Strategies: This article will help you learn key strategies for optimizing Python performance, from efficient algorithms to memory management. Read if you want to know how to enhance I/O operations, optimize API requests, and improve data serialization to create high-performing Python apps.
Django things you want with HTMX: This article highlights some key aspects and tips for effectively using HTMX with Django. It covers topics like using HTTP 303 for redirections, handling Django error response pages with Ajax, triggering client-side events from the backend, and more. Read if you want to explore the world of Django with HTMX and discover the valuable enhancements it can bring to your web development projects.
5 Ways to Measure Execution Time in Python: Learn how to measure execution time in Python with 5 methods: time.time(), time.perf_counter(), time.monotonic(), time.process_time(), and time.thread_time(). Read to understand the different precision levels each method offers and to be able to choose the right one for your benchmarking needs.
Thread Safety in Python: Thread safety in Python is crucial to avoid race conditions where a program's behavior depends on the timing of events. Python's Global Interpreter Lock (GIL) restricts true parallel execution but can still lead to race conditions. Read if you are working on concurrent and multithreaded applications to prevent data corruption and bugs caused by race conditions.
Learning Python and PsychoPy by writing games: This book made available under the CCBY-NC-ND 4.0 DEED license aims to teach programming, particularly to social and experimental psychology students. But if you are willing to give it a go, you will find a wealth of coding best practices and practical exercises to improve your general programming skills. Read if you want to be able to create sophisticated experiments, using core Python concepts, control structures, object-oriented programming, exceptions, and more. The book also introduces PsychoPy, a library for psychophysical experiments.
🧠 Expert insights from the Packt Community 📚
Here’s an exclusive excerpt from “Chapter 4, Data Acquisition Features: Web APIs and Scraping” in the book Python Real-World Projects by Steven F. Lott.
4.2 Project 1.3: Scrape data from a web page
In some cases, we want data that’s provided by a website that doesn’t have a tidy API. The data is available via an HTML page. This means the data is surrounded by HTML markup, text that describes the semantics or structure of the data.
We’ll start with a description of the application, and then move on to talk about the architectural approach. This will be followed with a detailed list of deliverables.
4.2.1 Description
We’ll continue to describe projects designed to acquire some data for further analysis. In this case, we’ll look at data that is available from a website, but is embedded into the surrounding HTML markup. We’ll continue to focus on Anscombe’s Quartet data set because it’s small and diagnosing problems is relatively simple. A larger data set introduces additional problems with time and storage.
Parts of this application are an extension to the project in Project 1.2: Acquire data from a web service. The essential behavior of this application will be similar to the previous project. This project will use a CLI application to grab data from a source.
The User Experience (UX) will also be a command-line application with options to fine-tune the data being gathered. Our expected command line should look something like the following:
% python src/acquire.py -oquartet --page "https://en.wikipedia.org/wiki/Anscombe’s_quartet"--caption "Anscombe’s quartet"
The -o quartet argument specifies a directory into which four results are written. These will have names like quartet/series_1.json.
The table is buried in the HTML of the URL given by the --page argument. Within this HTML, the target table has a unique <caption> tag: <caption>Anscombe’s quartet</caption>.
4.2.2 About the source data
This data embedded in HTML markup is generally marked up with the <table> tag. A table will often have the following markup:
<table class=”wikitable”style=”text-align: center; margin-left:auto;
margin-right:auto;” border=”1”>
<caption>Anscombe’squartet</caption>
<tbody>
<tr>
<th colspan=”2”>I</th>
etc.
</tr>
<tr>
<td><i>x</i></td>
<td><i>y</i></td>
etc.
</tr>
<tr>
<td>10.0</td>
<td>8.04</td>
etc.
</tr>
</tbody>
</table>
In this example, the overall <table> tag will have two child tags, a <caption> and a <tbody>.
The table’s body, within <tbody>, has a number of rows wrapped in <tr> tags. The first row has headings in <th> tags. The second row also has headings, but they use the <td> tags. The remaining rows have data, also in <td> tags.
This structure has a great deal of regularity, making it possible to use a parser like Beautiful Soup to locate the content.
…This section has looked at the input and processing for this application. The output will match earlier projects. In the next section, we’ll look at the overall architecture of the software.
4.2.3 Approach
We’ll take some guidance from the C4 model when looking at our approach:
Context: For this project, a context diagram would show a user extracting data from a source. You may find it helpful to draw this diagram.
Containers: One container is the user’s personal computer. The other container is the Wikipedia website, which provides the data.
Components: We’ll address the components below.
Code: We’ll touch on this to provide some suggested directions.
It’s important to consider this application as an extension to the project in Chapter 3, Project 1.1: Data Acquisition Base Application. The base level of architectural design is provided in that chapter.
In this project, we’ll be adding a new html_extract module to capture and parse the data. The overall application in the acquire module will change to use the new features. The other modules should remain unchanged.
A new architecture that handles the download of HTML data and the extraction of a table from the source data is shown in Figure 4.4.
Figure 4.4: Revised Component Design
This diagram suggested classes for the new html_extract module. The Download class uses urllib.request to open the given URL and read the contents. It also uses the bs4 module (Beautiful Soup) to parse the HTML, locate the table with the desired caption, and extract the body of the table.
The PairBuilder class hierarchy has four implementations, each appropriate for one of the four data series. Looking back at Chapter 3, Project 1.1: Data Acquisition Base Application, there’s a profound difference between the table of data shown on the Wikipedia page, and the CSV source file shown in that earlier project. This difference in data organization requires slightly different pair-building functions.
Making an HTML request with urllib.request
The process of reading a web page is directly supported by the urllib.request module. The url_open() function will perform a GET request for a given URL. The return value is a file-like object — with a read() method — that can be used to acquire the content.
This is considerably simpler than making a general RESTful API request where there area variety of pieces of information to be uploaded and a variety of kinds of results that might be downloaded. When working with common GET requests, this standard library module handles the ordinary processing elegantly.
A suggested design for the first step in the operation is the following function:
from urllib.request importurlopen
from bs4 importBeautifulSoup, Tag
def get_page(url: str)-> BeautifulSoup:
return BeautifulSoup(
urlopen(url), "html.parser"
)
The urlopen() function will open the URL as a file-like object, and provide that file to the BeautifulSoup class to parse the resulting HTML.
A try: statement to handle potential problems is not shown. There are innumerable potential issues when reaching out to a web service, and trying to parse the resulting content. You are encouraged to add some simple error reporting.
In the next section, we’ll look at extracting the relevant table from the parsed HTML.
HTML scraping and Beautiful Soup
The Beautiful Soup data structure has a find_all() method to traverse the structure. This will look for tags with specific kinds of properties. This can examine the tag, the attributes, and even the text content of the tag.
…In this case, we need to find a <table> tag with a caption tag embedded within it. That caption tag must have the desired text. This search leads to a bit more complex investigation of the structure. The following function can locate the desired table.
def find_table_caption(
soup: BeautifulSoup,
caption_text: str = "Anscombe’squartet"
) -> Tag:
for table in soup.find_all(’table’):
if table.caption:
if table.caption.text.strip() ==caption_text.strip():
return table
raise RuntimeError(f"<table>with caption {caption_text!r} not found")
Some of the tables lack captions. This means the expression table.caption.text won’t work for string comparison because it may have a None value for table.caption. This leads to a nested cascade of if statements to be sure there’s a <caption> tag before checking the text value of the tag.
The strip() functions are used to remove leading and trailing whitespace from the text because blocks of text in HTML can be surrounded by whitespace that’s not displayed, making it surprising when it surfaces as part of the content. Stripping the leading and trailing whitespace makes it easier to match.
The rest of the processing is left for you to design. This processing involves finding all of the <tr> tags, representing rows of the table. Within each row(except the first) there will be a sequence of <td> tags representing the cell values within the row.
Once the text has been extracted, it’s very similar to the results from a csv.reader.
After considering the technical approach, it’s time to look at the deliverables for this project.
4.2.4 Deliverables
This project has the following deliverables:
Documentation in the docs folder.
Acceptance tests in the tests/features and tests/steps folders.
Unit tests for the application modules in the tests folder.
Mock HTML pages for unit testing will be part of the unit tests.
Application to acquire data from an HTML page.
We’ll look at a few of these deliverables in a little more detail.
Unit test for the html_extract module
The urlopen() function supports the http: and https: schemes. It also supports the file: protocol. This allows a test case to use a URL of the form file:///path/to/a/file.html to read a local HTML file. This facilitates testing by avoiding the complications of accessing data over the internet.
For testing, it makes sense to prepare files with the expected HTML structure, as well as invalid structures. With some local files as examples, a developer can run test cases quickly.
Generally, it’s considered a best practice to mock the BeautifulSoup class. A fixture would respond to the various find_all() requests with mock tag objects.
When working with HTML, however, it seems better to provide mock HTML. The wide variety of HTML seen in the wild suggests that time spent with real HTML is immensely valuable for debugging.
Creating BeautifulSoup objects means the unit testing is more like integration testing. The benefits of being able to test a wide variety of odd and unusual HTML seems to be more valuable than the cost of breaking the ideal context for a unit test.
Having example HTML files plays well with the way pytest fixtures work. A fixture can create a file and return the path to the file in the form of a URL. After the test, the fixture can remove the file.
A fixture with a test HTML page might look like this:
from pytest import fixture
from textwrap importdedent
@fixture
def example_1(tmp_path):
html_page = tmp_path /"works.html"
html_page.write_text(
dedent("""\
<!DOCTYPE html>
<html>
etc.
</html>
"""
)
)
yield f"file://{str(html_page)}"
html_page.unlink()
This fixture uses the tmp_path fixture to provide access to a temporary directory used only for this test. The file, works.html, is created, and filled with an HTML page. The test case should include multiple <table> tags, only one of which was the expected <caption> tag.
The dedent() function is a handy way to provide a long string that matches the prevailing Python indent. The function removes the indenting whitespace from each line; the resulting text object is not indented.
The return value from this fixture is a URL that can be used by the urlopen() function to open and read this file. After the test is completed, the final step (after the yield statement) will remove the file.
A test case might look something like the following:
def test_steps(example_1):
soup = html_extract.get_page(example_1)
table_tag =html_extract.find_table_caption(soup, "Anscombe’s quartet")
rows =list(html_extract.table_row_data_iter(table_tag))
assert rows == [
[],
[’Keep this’, ’Data’],
[’And this’, ’Data’],
]
The test case uses the example_1 fixture to create a file and return a URL referring to the file. The URL is provided to a function being tested. The functions within the html_extract module are used to parse the HTML, locate the target table,and extract the individual rows.
The return value tells us the functions work properly together to locate and extract data. You are encouraged to work out the necessary HTML for good — and bad — examples.
Python Real-World Projects by Steven F. Lott was released in September, 2023. Read the second chapter available for free and buy the book here or signup for a Packt Subscription to access the complete book and the entire Packt digital library. To explore more, click on the button below.
And that’s a wrap.
We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most useful here. If you have any comments or feedback, take the survey.
See you next week!