PythonPro #25: Enhancing ML Models, Python 3.12.3's Bug Fixes, Palo Alto's Python Backdoor, Streamlined Data Validation, and Logging

Bite-sized actionable content, practical tutorials, and resources for Python programmers and data scientists

Apr 18, 2024

Welcome to a brand new issue of PythonPro!

In today’s Expert Insight we bring you an excerpt from the recently published book, Active Machine Learning with Python, which talks about enhancing the accuracy and reliability of your ML models through improved data annotation practices such as training annotators, using control samples for ongoing skill assessments, and employing multiple annotators.

News Highlights: Python 3.12.3 brings over 300 bug fixes; hackers use Python backdoor in Palo Alto Zero-Day attack; pyaging introduces GPU-optimized aging clocks; and dpq enhances data processing with Gen AI.

Here are my top 5 picks from our learning resources today:

Reddit's Architecture - The Evolutionary Journey🏗️
The Ultimate Guide to Python Logging Essentials📜
The Math Behind Deep CNN — AlexNet🧠
Pydantic - Simplifying Data Validation in Python🛡️
Locking mechanisms in python scripts🔒

Dive in, and let me know what you think about this issue in today’s survey!

Stay awesome!

Divya Anne Selvaraj

Editor-in-Chief

P.S: If you have any food for thought, feedback, or would like us to find you a Python learning resource on a particular subject for our next issue, take the survey.

Sign Up | Advertise

🐍 Python in the Tech 💻 Jungle 🌳

🗞️News

Python 3.12.3 released: This is the third maintenance update for Python 3.12, featuring over 300 bug fixes, enhancements, and documentation updates. Read to learn about the update and future performance enhancements.
Hackers Deploy Python Backdoor in Palo Alto Zero-Day Attack: Dubbed Operation MidnightEclipse, attackers employed a Python backdoor, installing it via external command execution to maintain persistence and control. Read to learn more.
pyaging - a Python-based compendium ofGPU-optimized aging clocks: This paper introduces pyaging, a Python package facilitating aging research by harmonizing diverse aging clocks. Read to learn how it enables fast inference with large datasets.
dpq enhances data processing and feature engineering with Gen AI: This Python library offers functions like sentiment classification through predefined or custom prompts managed via JSON configurations. Read if you work with prompt-based engineering tasks.

💼Case Studies and Experiments🔬

LLM-Powered Django Admin Fields: This article details integrating ChatGPT with Django Admin to streamline copywriting for product detail pages at Grove Collaborative. Read to learn how to enhance Django Admin fields with LLM-powered capabilities for dynamic content generation.
Reddit's Architecture - The Evolutionary Journey: Initially built in Lisp and later rewritten in Python, Reddit's journey includes transitioning from a monolithic application to a distributed architecture using AWS, microservices, and advanced data replication techniques. Read to gain insight into complex architectural transformations in high-traffic platforms.

📊Analysis

Using GitHub Copilot for Test Generation in Python - An Empirical Study: This study evaluates the effectiveness of GitHub Copilot in generating Python unit tests, both within existing test suites and independently. Read for insights into the practical usability and modifications needed for integrating GitHub Copilot into existing development workflows.
Trying out Rye: The article reflects on the author's exploration of Rye, a tool aimed to streamline Python project management similar to Rust’s Cargo. Read for insights into Rye’s suitability for you depending on your project needs and familiarity with Python environments.

🎓 Tutorials and Guides 🤓

Build a Blog Using Django, GraphQL, and Vue: This tutorial guides you through building a blog using Django as the backend, Vue.js for the frontend, and GraphQL for the API layer. Read to discover a comprehensive approach to building and managing a blog with separate backend and frontend services.
The Ultimate Guide to Python Logging Essentials: This article elucidates Python's logging module, detailing its components: records, handlers, formatters, and loggers. Read for illustrative code examples and insights for effective logging implementation.
Redis <-> RQScheduler <-> Celery for Dynamic Task Scheduling and Concurrent Execution in Django Backend: This article discusses a solution for high concurrency and dynamic scheduling challenges. Read to learn about the benefits of integrating Redis and RQScheduler for high-performance backend operations within Django applications.
PyScript is Growing Up: PyScript, a technology merging Python with web capabilities, has significantly evolved with its latest update, PyScript Next. Read to learn how to build a data visualization application using PyScript.
The Math Behind Deep CNN — AlexNet: This comprehensive guide covers AlexNet's architecture, including its layers and functions, and explains the model's use of ReLU activations, GPU acceleration, and dropout regularization. Read for insights into building and implementing the AlexNet model using Python, specifically with the PyTorch library.
Awesome Python Library - Tenacity: The Tenacity library in Python simplifies the implementation of retry logic for unreliable code blocks, such as those affected by external factors like network issues. Read to learn how Tenacity can streamline handling intermittent failures in Python scripts.
Bringing Lexical search to Python Pandas using SearchArray: This artilce introduces "SearchArray," explaining essential elements like tokenization, index creation, document matching, and scoring matches using algorithms like BM25. Read for code examples and references for further exploration.

🔑Best Practices, Advice, and Code Optimization🔏

The 2024 breaking into data engineering roadmap: This detailed roadmap emphasizes mastery of SQL and Python, understanding distributed compute systems like Spark or BigQuery, and skills in data orchestration and modeling. Read to identify the necessary skills and strategic approaches to secure a data engineering job.
"I'm One of You!" Said the Function to the Other Objects: This article explains how functions in Python are treated as first-class objects, meaning they can be assigned to variables, passed as arguments, returned by other functions, and have attributes like other objects. Read to enhance your ability to write more flexible and powerful code.
Pydantic - Simplifying Data Validation in Python: This tutorial introduces the library as a robust solution for data validation and settings management in Python, aimed at reducing boilerplate code and increasing code reliability. Read to learn how to apply Pydantic.
Locking mechanisms in python scripts: This article explores creating a locking mechanism in Python to prevent concurrent execution of specific commands in scripts used in CI/CD pipelines. Read to gain insights into implementing robust locking mechanisms in Python.
Shape typing in Python: Python has evolved to include advanced static types, enabling type-safe matrix multiplication by explicitly defining input and output shapes. Read this concise post to learn how to implement and benefit from type-safe matrix operations in Python.

Take the Survey, Request a Learning Resource

🧠 Expert insight 📚

Here’s an excerpt from “Chapter 3: Managing the Human in the Loop” in the book, Active Machine Learning with Python by Margaux Masson-Forsythe published in March 2024.

Ensuring annotation quality and dataset balance

Maintaining high annotation quality and target class balance requires diligent management. In this section, we’ll look at some techniques that can help assure

labeling quality.

Assess annotator skills

It is highly recommended that annotators undergo thorough training sessions and complete qualification tests before they can work independently. This ensures that they have a solid foundation of knowledge and understanding in their respective tasks. These performance metrics can be visualized in the labeling platform when the reviewers accept or reject annotations. If a labeler has many rejected annotations, it is necessary to ensure that they understand the task and assess what help can be provided to them.

It is advisable to periodically assess the labeler’s skills by providing control samples for evaluation purposes. This ongoing evaluation helps maintain the quality and consistency of their work over time.

For example, designing datasets with known labels and asking the labelers to label these evaluation sets can be a good way to check if the task is well understood. Then, we can assess the accuracy of the annotations using a simple Python script.

First, we must define some dummy annotations that have been made by a labeler and some real annotations:

dummy_annotator_labels = ['positive', 'negative', 'positive',
'positive', 'positive']
dummy_known_labels = ['negative', 'negative', 'positive', 'positive',
'negative']

Then, we must calculate the accuracy and kappa score using the sklearn function:

accuracy = accuracy_score(dummy_annotator_labels, dummy_known_labels)
print(f"Annotator accuracy: {accuracy*100:.2f}%")
kappa = cohen_kappa_score(dummy_annotator_labels, dummy_known_labels)
print(f"Cohen's Kappa: {kappa:.3f}")

This returns the following output:

Annotator accuracy: 60.00%
Cohen's Kappa: 0.286

This technique is a simple and easy way to implement a basic assessment of the annotator’s skills. Aim for an accuracy of above 90% and a kappa score of above 0.80 and then we can investigate poor agreements.

Use multiple annotators

If your budget allows, you can assign each data point to multiple annotators to identify conflicts. These conflicts can then be resolved through consensus or by an expert reviewer.

For example, with sentiment analytics labeling, we have our dummy annotations for three labelers:

dummy_annotator_labels_1 = ['positive', 'negative', 'positive',
    'positive', 'positive']
dummy_annotator_labels_2 = ['positive', 'negative', 'positive',
    'negative', 'positive']
dummy_annotator_labels_3 = ['negative', 'negative', 'positive',
    'positive', 'negative']

We can create a pandas DataFrame with the labels from the three labelers:

df = pd.DataFrame({
    "Annotator1": dummy_annotator_labels_1,
    "Annotator2": dummy_annotator_labels_2,
    "Annotator3": dummy_annotator_labels_3
})

Then, we can take the majority vote as a real label:

df["MajorityVote"] = df.mode(axis=1)[0]
print(df["MajorityVote"])

This returns the following output:

0    positive
1    negative
2    positive
3    positive
4    positive

This method can be expensive because the labelers work on the same data, but it can ultimately result in more accurate annotations. Its feasibility depends on the priorities of the ML project, as well as the budget and organization of the labeling team. For instance, if the labeling team consists of junior labelers who are new to the field, this method may be a suitable choice.

Balanced sampling

To prevent imbalanced datasets, we can actively sample minority classes at higher rates during data collection.

When collecting a dataset, it is important to monitor the distribution of labels across classes and adjust sampling rates accordingly. Without intervention, datasets often end up skewed toward majority classes due to their natural higher frequencies.

Let’s look at some ways to actively sample minority classes at higher rates:

Employing active ML approaches such as uncertainty sampling can bias selection toward rare cases. Indeed, uncertainty sampling actively selects the data points that the current model is least certain about for labeling. These tend to be edge cases and rare examples, rather than the common cases the model has already seen many examples of. Since, by definition, minority classes occur less frequently, the model is naturally more uncertain about these classes. So, uncertainty sampling will tend to pick more examples from the under-represented classes for labeling to improve the model’s understanding.
Checking label distributions periodically during data collection is important. If minority classes are underrepresented, it is recommended to selectively sample more data points with those labels. This can be achieved by sampling the data from the unlabeled data pool using a pre-trained model that can identify the unrepresented classes. To ensure higher representation, the sampling strategy should be set to select specific classes with a higher ratio. For example, let’s reuse the imdb dataset from Hugging Face:
dataset = load_dataset('imdb')
For testing purposes, we assume that the dataset is unlabeled and that the labels attached to it are from the model’s predictions. So, our goal is to sample the under-represented class. Let’s assume class 0 is under-represented and we want to over-sample it. First, we must take the training dataset as our dummy unlabeled data pool and convert it into a pandas DataFrame:
dummy_unlabeled_dataset_with_predictions_from_a_model \
dataset['train']
df = pd.DataFrame(
dummy_unlabeled_dataset_with_predictions_from_a_model)
Next, we must get the number of data points for each label:
n_label_0 = df[df['label'] == 0].shape[0]
n_label_1 = df[df['label'] == 1].shape[0]
Now, we must calculate the number of samples to sample for each label, assuming we want to sample 1,000 samples and we want 80% of these samples to belong to class 0 and 20% to class 1:
nb_samples = 1000
n_sample_0 = int(0.8 * nb_samples)
n_sample_1 = int(0.2 * nb_samples)
sample_0 = df[df['label'] == 0].sample(n_sample_0,
replace=False)
sample_1 = df[df['label'] == 1].sample(n_sample_1,
replace=False)
# Concatenate the two samples into a single dataframe
sample_df = pd.concat([sample_0, sample_1], ignore_index=True)
# Print the sample dataframe
print(f"We have {len(sample_df['label'][sample_df['label']==0])} class 0 samples and {len(sample_df['label'][sample_df['label']==1])} class 1 samples")
This gives us the following output:
We have 800 class 0 samples and 200 class 1 samples
So, we sampled with the correct ratio and can, in theory, add these samples to our labeling queue next. By setting a higher sampling ratio for class 0 from the unlabeled data, we selectively oversample the minority class when getting new labeled data.

The key is closely tracking the evolving label distribution and steering sampling toward under-represented classes. This prevents highly imbalanced datasets that fail to provide sufficient examples for minority classes. The result is higher-quality, more balanced training data.

You can buy Active Machine Learning with Python by Margaux Masson-Forsythe here. Packt library subscribers can continue reading the chapter for free here.

Get the book!

And that’s a wrap.

We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most useful here. The complete PythonPro archives can be found here.

If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, leave a comment below!