PythonPro #61: Meta’s Llama Flaw, Codon’s NumPy Boost, and Web2Vec for Automated Website Analysis

Feb 06, 2025

In today’s Expert Insight we bring you an excerpt from the recently published book, Generative AI on Google Cloud with LangChain, which discusses how LLMs generate plausible but sometimes false responses (hallucinations), and demonstrates how structured prompting with LangChain can help mitigate the issue.

News Highlights: Meta's Llama flaw exposes AI servers to remote code execution via Python’s pickle; Codon’s 2025 update brings a faster, open-source NumPy with GPU support; Codegen, a Python library for code refactoring and analysis, is now open source.

My top 5 picks from today’s learning resources:

And, in From the Cutting Edge, we introduce Web2Vec, a Python library that automates website analysis by extracting over 200 structured parameters through web crawling and direct feature extraction, with potential applications in cybersecurity, SEO, and machine learning.

Stay awesome!

Divya Anne Selvaraj

Editor-in-Chief

Sign Up | Advertise

🐍 Python in the Tech 💻 Jungle 🌳

🗞️News

Meta's Llama Framework Flaw Exposes AI Systems to Remote Code Execution Risks: The critical vulnerability exposed AI inference servers to remote code execution due to unsafe deserialization with Python’s pickle module.
Codon in 2025: New compiler-optimized NumPy implementation. Switching to an open source license: This update leverages Codon’s multithreading, GPU capabilities, and compiler optimizations, achieving significant speed improvements over standard NumPy.
Codegen is now open source: Codegen is a Python library for advanced code manipulation, enabling large-scale refactoring, pattern enforcement, and static analysis without requiring deep AST knowledge.

💼Case Studies and Experiments🔬

How I Built a Python RL Trading Bot That Simulated 1150% Profit: Describes building a bot that identifies potential short squeezes using Financial Modeling Prep’s Fail-to-Deliver and historical stock price data.
An empirical study of developers’ challenges in implementing Workflows as Code: A case study on Apache Airflow: Analyzes 1,000 Stack Overflow posts to categorizes challenges into workflow definition, execution, environment setup, quality assurance, security, and optimization, identifying key pain points.

📊Analysis

Observations: Using Python with DeepSeek-R1: Explores using DeepSeek-R1 for AI tasks, covering API integration, response streaming, and Retrieval-Augmented Generation (RAG) while analyzing its reasoning process.
Decorator JITs - Python as a DSL: Covers AST-based, bytecode-based, and tracing JITs, showing how they convert Python code into optimized LLVM IR for execution.

🎓 Tutorials and Guides 🤓

Security and cryptography algorithms: A guide: Covers block and stream ciphers, hashing, key exchange, public key encryption, and cryptographic libraries, including practical examples using Python.
TLS and networking: Explains TLS handshakes, encryption, certificate verification, networking layers, HTTP protocols, sockets, firewalls, and secure app deployment. Also discusses mutual TLS, DNS, and network security.
On Shared Birthdays (and a Bit on Pythagorean Triplets) • Pythonic Code: Demonstrates probability concepts with itertools , collections , and datetime,simulates birthday collisions, analyzes probability with brute-force, and more.
Make Sick Beats with Python: Explains how to build a simple drum machine in Python using the pygame library, covering setting up the environment, storing and playing sounds, representing music in code, and more.
Create an Adaptive Customer Behavior Analytics Dashboard with Claude AI and Python: Demonstrates building a dashboard which analyzes uploaded CSV data, generates Python scripts, executes them, and creates visualizations.
Nine Pico PIO Wats with MicroPython (Part 2): Through debugging techniques and real-world examples, it demonstrates workarounds for PIO limitations while building a theremin-like musical instrument.
Managing Magento Configurations with PyGento: A Powerful Python CLI Tool for Developers: Explains how the tool integrates with PyGento, provides database access via SQLAlchemy, and automates tasks like searching, viewing, and editing Magento settings without using the admin panel.

🔑Best Practices and Advice🔏

AI Python Libraries: A centralized resource listing 1,037 libraries with descriptions and use cases for AI development, covering deep learning, machine learning, NLP, and data science.
Python Code for Automated Log Analysis & Alerting: Covers parsing system, firewall, and web server logs to detect suspicious activity like brute force attacks and malicious IPs.
9 Statistical Distributions Every Data Scientist Should Know: Introduces key statistical distributions, explains their characteristics, and provides practical examples.
Crafting a Winning Conference Talk: Lessons from a PyCon US Reviewer: Provides guidance based on the author's experience as a PyCon US reviewer, outlining common mistakes and recommendations for writing a strong conference talk proposal.
100 Common Python Mistakes: Covers logic bugs, inefficient code, non-Pythonic practices, and best practices for readability and performance, using clear examples for each mistake.

🔍From the Cutting Edge: Web2Vec — A Python Library for Website-to-Vector Transformation💥

In "Web2Vec: A Python Library for Website-to-Vector Transformation," D. Frąszczak and E. Frąszczak introduce Web2Vec, a Python library that converts websites into structured vector representations. The library automates feature extraction from web pages, integrating over 200 parameters from website content and structure to enable efficient analysis.

Context

Website processing involves extracting and transforming web data for analysis. This includes web crawling, which systematically navigates websites to collect data, and web scraping, which extracts specific information from web pages. Website feature extraction identifies key attributes such as structure, security settings, and external references, while vectorisation converts unstructured data into numerical formats for computational analysis.

The internet hosts over a billion websites, with millions actively generating data daily. Extracting insights is essential for market research, cybersecurity, and machine learning. While many research papers discuss web data collection, they often rely on custom scripts, leading to inefficiencies. Existing services like WHOIS, SimilarWeb, and Google Search Index provide valuable data but restrict free API access, requiring users to parse raw HTML instead. Web2Vec addresses these challenges with a unified, open-source solution for automated website data extraction and analysis.

Key Features

Web2Vec offers a structured approach to web data collection and analysis through:

Automated Website Crawling – Uses Scrapy-based spiders to extract data from single pages or entire websites.
Comprehensive Feature Extraction – Captures 211 parameters, including URL lexical features, HTML content, HTTP response details, SSL certificates, WHOIS data, and traffic metrics.
Flexible Data Processing – Supports active crawling and pre-generated HTML snapshots.
Cybersecurity Applications – Detects phishing sites, misinformation, and suspicious activity by integrating services like PhishTank and OpenPhish.
Graph-Based Analysis – Visualises website relationships through network graphs for deeper insights.
Open-Source & Extensible – Available on PyPI (pip install web2vec ), with community support for updates and improvements.

What This Means for You

Web2Vec is a valuable tool for professionals and researchers working with web data. Data scientists can automate website feature extraction for large-scale analysis, while cybersecurity professionals can detect phishing and misinformation using structured data and threat intelligence services. SEO and marketing professionals can benefit from its ability to analyse rankings, metadata, and traffic sources. Developers and web scraping practitioners can replace custom scripts with Web2Vec’s unified framework, reducing redundancy and improving efficiency.

Examining the Details

Web2Vec processes websites using two primary methods: direct extraction from pre-generated HTML snapshots and active web crawling with Scrapy-based spiders. It extracts 211 parameters across multiple categories, converting website characteristics into structured vector representations. The extracted data, stored in JSON format, includes details such as URL, title, HTML response, and status codes. Users can either crawl entire websites using the Web2VecSpider or extract specific parameters through direct method invocation. Web2Vec also supports visualising website relationships using directed graphs, making it particularly useful for cybersecurity applications such as phishing detection and misinformation analysis.

You can learn more by reading the entire paper or accessing the library on GitHub.

🧠 Expert insight💥

Here’s an excerpt from “Chapter 3: Grounding Responses” in the book, Generative AI on Google Cloud with LangChain by Leonid Kuligin, Jorge Zaldívar, and Maximilian Tschochohei.

Overcoming hallucinations in LLMs

LLMs are trained on large amounts of publicly available data (please take a look at Appendix 1 for more details on how LLMs are trained). By design, they absorb the information they are explicitly given in the prompt and the information they have seen during the

training. By default, LLMs don’t have access to any external information (except for what they have already memorized), and in most cases, they’re autoregressive models (they predict output words or tokens one by one by looking at the previous input, and this limits their reasoning capabilities). We’ll see some examples of how we can expand an LLM’s reasoning capabilities with agentic workflows in Chapters 9, 10, and 11.

Simply put, LLMs will use the information from their training to respond to a prompt. LLMs effectively reproduce human language and, therefore, their answers sound very credible even if they are just a probabilistic continuation of the prompt. This can be correct if the training corpus has the right information – or it can just be plausible, but factually wrong. If it’s the latter, the LLM is said to hallucinate.

Let’s take a simple example:

Prompt: Tell me about the black cucumbers.
Response:
Black Cucumbers
Black cucumbers, scientifically known as Cucumis sativus var. Nigra, are an intriguing variety of cucumber characterized by their unique dark green to black skin. They are not genetically modified but instead occur naturally due to a higher concentration of chlorophyll.
Origin and Cultivation
Black cucumbers are native to Europe and Asia, particularly Russia and China. They are grown in warm, moist climates and require well-drained soil rich in organic matter. The fruits typically mature within 60-70 days after planting.
Physical Characteristics
Skin: Deep green to black, with occasional lighter... (The response was truncated because it has reached the token limit. Try to increase the token limit if you need a longer response.)

At least to my knowledge, black cucumbers don’t exist. But the LLM has produced a response based on a description of actual cucumbers that looks plausible and real. In the summer of 2023, a US lawyer used an LLM to answer legal questions. The LLM cited non-existing cases, but they looked so trustworthy that the lawyer used them in court and got into problems [1].

Sometimes hallucinations can be overcome by prompt engineering. Let’s look at the following prompt template:

from langchain.chains import LLMChain
from langchain.prompts.prompt import PromptTemplate
from langchain_google_vertexai import VertexAI
llm = VertexAI(model_name="gemini-1.0-pro",
temperature=0.8, max_output_tokens=128)
template = """Describe {plant}.
First, think whether {plant} exist.
If they {plant} don't exist, answer "I don't have enough information about {plant}".
Otherwise, give their title, a short summary and then talk about origin and cultivation.
After that, describe their physical characteristics.
"""
prompt_template = PromptTemplate(
input_variables=["plant"],
template=template,
)
chain = LLMChain(llm=llm, prompt=prompt_template)
chain.run(plant="black cucumbers")

If we run this chain, we’ll get a relevant answer:

I don't have enough information about black cucumbers.

You can double-check and ask the question about green cucumbers to make sure that the LLM will give a correct answer with this prompt.

Hallucinations

Hallucinations are one of the key problems that the industry is facing at the moment.

The good news: There are ways to significantly reduce hallucination rates, and we’re going to discuss them in this and the next chapters.

The bad news: Any generative AI (GenAI) might produce hallucinations, and you need to evaluate and monitor them during application development. We’ll talk about evaluation in Chapter 14.

Generative AI on Google Cloud with LangChain was published in December 2024. Packt library subscribers can continue reading the entire book for free.

Get the eBook for ~~$35.99~~ $24.99

Get the Print Book for $44.99

And that’s a wrap.

We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most useful here. The complete PythonPro archives can be found here.

If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!

Discussion about this post

Ready for more?