Data Engineering with GenAI Book, The Great Docs Project | Issue 88
A weekly curated update on data science and engineering topics and resources.
This week's agenda:
Open Source of the Week - The Great Docs project
New learning resources - Stanford Frontier Systems course, Gemme 4 deep dive, running LLMs locally
Book of the week - Data Engineering with Generative and Agentic AI on AWS by Justin J. Leto
The newsletter is also available on LinkedIn and Medium.
Are you interested in learning about SQL AI agents in production? If so, please check out my LinkedIn Learning course:
Open Source of the Week
This week’s focus is on the Great Docs project. Great Docs is an open source project from Posit that automatically generates modern, production-ready documentation websites for Python packages. Rather than manually building documentation pages, configuring themes, and maintaining site structures, Great Docs inspects a package, discovers its public API, and generates a complete documentation site with references, guides, and deployment workflows. Built on top of Quarto, it aims to reduce documentation overhead while producing polished sites that work well for both human readers and AI-assisted workflows.
Project repo: https://github.com/posit-dev/great-docs
Key Features
Automatically discovers and documents package APIs, including classes, functions, exceptions, and type definitions
Detects common docstring formats such as NumPy, Google, and Sphinx styles and generates structured reference pages automatically
Built on Quarto, enabling rich publishing features including code execution, cross-references, and flexible layouts
Includes modern UI features such as dark mode, search, responsive layouts, keyboard shortcuts, and GitHub integration
Generates LLM-friendly outputs (llms.txt, Markdown pages, and agent skills) to improve compatibility with AI assistants and coding agents
Creates source code links with line references, making it easy to navigate from documentation back to implementation
Supports one-command deployment to GitHub Pages with automatically generated CI/CD workflows
Includes documentation quality tools for checking missing references and maintaining documentation consistency
More details are available in the project documentation.
License: MIT
New Learning Resources
Here are some new learning resources that I came across this week.
Stanford Frontier Systems
The following Stanford course hosts weekly sessions with global leaders solving the biggest bottlenecks in frontier technology progress.
Gemma 4 Deep Dive
The following talk by Cassidy Hardin, a researcher at Google DeepMind, focuses on the Gemma 4 models, Google’s recent open-source LLM.
Run AI Locally on Mac
The following video by Tech-Practice focuses on an approach for running AI locally on Apple Silicon using OpenCode and MLX.
Running Claude Code with Qwen 3.6
The following tutorial by James Layne provides a step-by-step guide for setting up Claude Code with local AI models, using Qwen 3.6 as an example.
Book of the Week
This week’s focus is on a new data engineering book - Data Engineering with Generative and Agentic AI on AWS: Building an AI-Augmented Data Practice for the Enterprise by Justin J. Leto. The book, as the name implies, focuses on GenAI and agentic systems frameworks for data engineering applications on AWS. Instead of treating AI as a separate capability layered onto existing pipelines, it demonstrates how AI can become part of the entire data lifecycle - from ingestion and transformation to retrieval, analytics, and business intelligence. Through practical examples and AWS services, the book provides a roadmap for designing intelligent data platforms that combine modern data architectures with AI-driven automation and decision support.
Topics Covered
Foundations of modern data engineering — core principles, workflows, and the evolving role of data engineers in an AI-augmented ecosystem
Data security and governance — strategies for managing access control, compliance, and trustworthy data systems
Modern data architectures — designing and implementing data lakes, data mesh, and data marts, and understanding where each architecture fits best
Big data processing and transformation — building scalable processing pipelines with AWS Glue and AI-assisted workflows
Pipeline orchestration and observability — monitoring, automating, and maintaining reliable data systems
Multimodal data enrichment — extracting insights from text, documents, images, and other unstructured data using AI services
Retrieval-Augmented Generation (RAG) — building semantic search systems with vector databases and enterprise knowledge retrieval
Streaming and real-time analytics — enriching live data streams with generative AI capabilities
Generative analytics and business intelligence — enabling natural language interfaces, text-to-SQL workflows, and AI-powered dashboards
Building autonomous agents — designing AI agents using frameworks, orchestration tools, and Model Context Protocol (MCP) concepts
Evaluating emerging AI technologies — frameworks for distinguishing practical business value from hype and identifying future opportunities
This book is ideal for data engineers, analytics engineers, architects, technical leaders, and AI practitioners who want to understand how to integrate generative and agentic AI into enterprise data platforms. It is especially useful for readers interested in AWS-based solutions and in preparing for the next evolution of data engineering workflows.
The book is available for purchase on the publisher’s website and on Amazon.
Have any questions? Please comment below!
See you next Saturday!
Thanks,
Rami




