PythonPro #52: AI-Powered Vulnhuntr for Python, SageMaker Core SDK, and Exploring User Behaviour with Python

Oct 24, 2024

Welcome to a brand new issue of PythonPro!

In today’s Expert Insight we bring you an excerpt from the recently published book, Building AI Applications with OpenAI APIs - Second Edition, which discusses how to create a language translation desktop app using OpenAI's ChatGPT API and Microsoft Word.

News Highlights: Protect AI to release Vulnhuntr, an AI tool for detecting Python zero-day vulnerabilities; Amazon launches SageMaker Core, a Python SDK simplifying machine learning with object-oriented interfaces; and PyCharm becomes the official IDE of OpenCV as JetBrains joins as a Silver Member.

Comprehensive Python Cheatsheet📚
Exploring User Behavior: A Python Case Study of Bike-Sharing Company Dataset🚴‍♂️
Python's property(): Add Managed Attributes to Your Classes🔧
Python approach to the Semantic Web: exploring linked data and RDF🌐
Assert vs. Raise: When to Use Each in Your ML/AI Projects⚠️

And, today’s Featured Study, presents ChangeGuard, a tool designed to compare code behaviour before and after changes to detect functionality modifications.

Stay awesome!

Divya Anne Selvaraj

Editor-in-Chief

P.S.: This month's survey is still live, do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.

Sign Up | Advertise

🐍 Python in the Tech 💻 Jungle 🌳

🗞️News

Open source LLM tool primed to sniff out Python zero-days: Researchers with Seattle-based Protect AI will soon release Vulnhuntr, an AI-powered open-source tool that uses Claude AI to detect zero-day vulnerabilities in Python codebases by analyzing entire call chains for security issues.
Introducing SageMaker Core: A new object-oriented Python SDK for Amazon SageMaker: The SDK will simplify the machine learning lifecycle by replacing complex JSON structures with object-oriented interfaces.
Press Release: PyCharm Becomes Official IDE of OpenCV, JetBrains Joins as Silver Member: As a Silver Member, JetBrains will financially support OpenCV, ensuring its resources remain free.

💼Case Studies and Experiments🔬

Part 2: Data Quality Dashboard: A Visual Approach to Monitoring Expectations in Databricks: Explains how to quickly identify issues using graphical representations like pie charts and bar charts.
Exploring User Behavior: A Python Case Study of Bike-Sharing Company Dataset: Uses Python to uncover user behaviour patterns and develop strategies to convert casual riders into annual members.

📊Analysis

🎥Russell Keith-Magee on Beeware, packaging, GUI & money in Python: Focuses on the challenges of cross-platform Python packaging, particularly for desktop and mobile platforms and discusses how BeeWare helps developers.
Should you use uv’s managed Python in production?: Advises careful consideration of uv’s production readiness, noting recent improvements but recommending thorough evaluation based on project-specific risks.

🎓 Tutorials and Guides 🤓

Python's property(): Add Managed Attributes to Your Classes: Covers creating read-only, read-write, and computed properties, logging, and more, while maintaining a stable public API for your classes.
A Multi-Agent AI Chatbot App using Databutton and Swarm: Explains how different agents can collaborate and hand off tasks, with an example of a multi-agent healthcare chatbot that connects users to specialized agents.
Understanding Pluggable Authentication Module (PAM) and Creating a Custom One in Python: Covers PAM’s architecture, module stacks, and control flags and walks you through building and integrating a custom PAM.
Python approach to the Semantic Web: exploring linked data and RDF: Covers creating RDF triples, querying SPARQL endpoints, and visualizing relationships using NetworkX.
Understanding Web Scraping in Python and Scrapy: Explains what web scraping is, its significance, and the tools required, such as BeautifulSoup, Requests, and Scrapy.
🎥A hand-holding guide to writing FUSE-based filesystems in Python: Covers the process of creating Python-based FUSE file systems, from basic functionality to more advanced features like file attributes.
Adding syntax to the cpython interpreter: Demonstrates how to add new syntax to Python, specifically making ternary statements default to None when no else condition is provided, similar to Ruby.

🔑Best Practices and Advice🔏

What I Learned from Making the Python Backend for YouTube Transcript Optimizer: Explains the process of building the Python backend for a YouTube Transcript Optimizer using FastAPI and SQLmodel.
Comprehensive Python Cheatsheet: An extensive resource covering a wide array of Python topics, including syntax, data structures, and advanced concepts.
How to Use Lambda Functions in Python: Covers their syntax, common use cases with functions like map() , filter() , and sorted() , along with advantages, limitations, and best practices for effective use in simplifying code.
Assert vs. Raise: When to Use Each in Your ML/AI Projects: Discusses when to use assert for internal checks during development and raise for handling user-facing errors in ML/AI projects to ensure robust error handling.
Structural Pattern Matching in Python: Explores customizing pattern matching for classes, extracting nested data, and common limitations in Python’s implementation.

🔍Featured Study: ChangeGuard - Validating Code Changes via Pairwise Learning-Guided Execution💥

In "ChangeGuard: Validating Code Changes via Pairwise Learning-Guided Execution," Gröninger et al. present a tool called ChangeGuard, which compares code behaviour before and after changes to determine whether the modifications alter functionality.

Context

Validating whether code changes preserve intended behaviour is a key challenge in software development, particularly when changes are deep within complex projects. Developers may make modifications to improve readability, performance, or to fix bugs, but unintended changes in functionality can lead to errors. Current methods, such as regression testing, often fail to catch these subtle changes. This study is relevant because it introduces a more reliable approach—ChangeGuard, which uses pairwise learning-guided execution. This approach involves running two versions of a code snippet simultaneously and predicting values to ensure the code runs correctly, even in complex scenarios.

Key Features of ChangeGuard

Pairwise learning-guided execution: Simultaneously executes old and new versions of code to compare their runtime behaviour.
Value injection: Predicts and injects missing or uninitialised values, ensuring the code executes smoothly and reaches all relevant paths.
High precision and recall: Achieves 77.1% precision and 69.5% recall in identifying behaviour-altering code changes.
Extensive evaluation: Tested on 224 manually annotated code changes and datasets generated by automated refactoring tools.
Outperforms regression tests: Traditional regression tests only achieved 7.6% recall in identifying semantics-changing code modifications.

What This Means for You

This paper will be most useful for software developers, especially those working with large and complex codebases. It provides practical insights into validating code changes more effectively than existing methods, offering a way to catch unintended behaviour early in the development process. Developers using automated refactoring tools or large language models like GPT-4 will particularly benefit from ChangeGuard's ability to detect subtle, behaviour-altering modifications.

Examining the Details

ChangeGuard's methodology is based on pairwise learning-guided execution, an extension of an existing technique. It predicts missing values dynamically, ensuring more execution paths are covered than previous approaches. The tool was evaluated on 224 annotated code changes from popular Python open-source projects, showing high accuracy in detecting semantics changes. Additionally, ChangeGuard was applied to automated refactoring tools and large language models like GPT-3.5 and GPT-4, where it found 87 out of 187 and 143 out of 258 code changes to unexpectedly alter behaviour. This comprehensive testing provides strong evidence for ChangeGuard's reliability and robustness.

You can learn more by reading the entire paper and accessing ChangeGuard.

🧠 Expert insight💥

Building AI Applications with OpenAI APIs - Second Edition

Here’s an excerpt from “Chapter 6: Language Translation Desktop App with the ChatGPT API and Microsoft Word” in the book, Building AI Applications with OpenAI APIs - Second Edition by Martin Yanev, published in October 2024.

Integrating the ChatGPT API with Microsoft Office

In this section, we will explore how to set up our project and install the docx Python library to extract text from Word documents. The docx library is a Python package that allows us to read and write

Microsoft Word (.docx ) files and provides a convenient interface to access information stored in these files.

The first step is to initiate your work by creating a new directory called Translation App and loading it with VSCode. This will enable you to have a dedicated area to craft and systematize your translation app code. Activate your virtual environment from the terminal window following the steps outlined in Chapter 1, Getting Started with the ChatGPT API for NLP Tasks.

To run the language translation desktop app, you will need to install the following libraries:

openai : The openai library allows you to interact with the OpenAI API and perform various NLP tasks
docx : The docx library allows you to read and write Microsoft Word .docx files using Python
tkinter : The tkinter library is a built-in Python library that allows you to create Graphical User Interfaces (GUIs) for your desktop app

As tkinter is a built-in library, there is no need for installation since it already exists within your Python environment. To install the openai and docx libraries, access the VSCode terminal, and then execute the following commands:

pip install openai
pip install python-docx

To access and read the contents of a Word document, you will need to create a sample Word file inside your project. Here are the steps to create a new Word file:

In your project, right-click on the project directory, select New Folder, and name it files .
Right-click on the files folder and select New File.
In the edit field that appears, enter a filename with the .docx extension – for example, info.docx .
Press the Enter key to create the file.
Once the file is created, open it using Microsoft Word.

You can now add some text or content to this file, which we will later access and read using the docx library in Python. For this example, we have created an article about New York City. You can find the complete article here: https://en.wikipedia.org/wiki/New_York_City. However, you can choose any Word document containing text that you want to analyze:

The United States’ most populous city, often referred to as New York City or NYC, is New York. In 2020, its population reached 8,804,190 people across 300.46 square miles, making it the most densely populated major city in the country and over two times more populous than the nation’s second-largest city, Los Angeles. The city’s population also exceeds that of 38 individual U.S. states. Situated at the southern end of New York State, New York City serves as the Northeast megalopolis and New York metropolitan area’s geographic and demographic center - the largest metropolitan area in the country by both urban area and population. Over 58 million people also live within 250 miles of the city. A significant influencer on commerce, health care and life sciences, research, technology, education, politics, tourism, dining, art, fashion, and sports, New York City is a global cultural, financial, entertainment, and media hub. It houses the headquarters of the United Nations, making it a significant center for international diplomacy, and is often referred to as the world’s capital.

Now that you have created the Word file inside your project, you can move on to the next step, which is to create a new Python file called app.py inside the Translation App root directory. This file will contain the code to read and manipulate the contents of the Word file using the docx library. With the Word file and the Python file in place, you are ready to start writing the code to extract data from the document and use it in your application.

To test whether we can read Word files with the docx-python library, we can implement the following code in our app.py file:

import docx
doc = docx.Document("<full_path_to_docx_file>")
text = ""
for para in doc.paragraphs:
text += para.text
print(text)

Make sure to replace <full_path_to_docx_file> with the actual path to your Word document file. Obtaining the file path is a simple task, achieved by right-clicking on your .docx file in VSCode and selecting the Copy Relative Path option from the drop-down menu.

Once you have done that, run the app.py file and verify the output. This code will read the contents of your Word document and print them to the console. If the text extraction works correctly, you should see the text of your document printed in the console (see Figure 6.1). The text variable now holds the data from info.docx as a Python string.

Figure 6.1 – Word text extraction console output

Packt library subscribers can continue reading the entire book for free. You can buy Building AI Applications with OpenAI APIs - Second Edition, here.

Get the eBook for ~~$31.99~~ $21.99!

Get the Print Book for $39.99!

And that’s a wrap.

We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most useful here. The complete PythonPro archives can be found here.

If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, take the survey or leave a comment below!