DiffusionGemma, Column-Level Data Lineage Engine, LLMs: The Hard Parts | Issue 93
A weekly curated update on data science and engineering topics and resources.
This week’s agenda:
Open Source of the Week - DiffusionGemma
New learning resources - Gemma 4 12B MTP local test, Column-Level Data Lineage Engine From Scratch, DuoBench planner/implementer LLM pair benchmark
Book of the week - Large Language Models: The Hard Parts: Open Source AI Solutions for Common Pitfalls by Thársis T. P. Souza and Jonathan K. Regenstein Jr.
The newsletter is also available on LinkedIn and Medium.
Are you interested in learning about SQL AI agents in production? If so, please check out my LinkedIn Learning course:
Open Source of the Week
This week’s focus is on the DiffusionGemma project. DiffusionGemma is an open-source experimental model from Google that explores text diffusion as an alternative to autoregressive token-by-token generation. Rather than generating one token at a time from left to right, DiffusionGemma drafts an entire 256-token block in parallel through iterative refinement — starting from a canvas of random placeholder tokens and progressively locking in correct ones — delivering up to 4x faster inference on dedicated GPUs (1000+ tokens/sec on an NVIDIA H100, 700+ on an RTX 5090). The model is a 26B Mixture of Experts built on the Gemma 4 family and Gemini Diffusion research, activating only 3.8B parameters per inference step and fitting within 18GB of VRAM when quantized. It targets local, low-concurrency, latency-sensitive workflows such as in-line editing, code infilling, and rapid iteration, where bi-directional attention and parallel decoding give it an edge over standard autoregressive Gemma 4. The trade-off is output quality below standard Gemma 4 and diminishing returns in high-QPS cloud serving, where batching already saturates compute.
Project repo: https://huggingface.co/google/diffusiongemma-26B-A4B-it
Key Features
Parallel block decoding — generates 256 tokens per forward pass instead of one, shifting the decode bottleneck from memory bandwidth to compute and saturating local GPU utilization.
Bi-directional attention — every token attends to all others in the block, which helps non-linear domains such as code infilling, mathematical graphs, and amino acid sequences.
Iterative self-correction — the model refines its own output across multiple passes, evaluating the entire block to fix mistakes before convergence.
MoE efficiency — 26B total parameters with only 3.8B active during inference, fitting in 18GB VRAM when quantized for consumer GPUs (RTX 5090, RTX 4090).
Multi-runtime support — official integrations with MLX, vLLM (with Red Hat support), and Hugging Face Transformers, with
llama.cppsupport arriving soon.Fine-tuning paths — a
Hackable DiffusionJAX toolbox from Google plus tutorials with Unsloth and NVIDIA NeMo for adapting the model to non-linear tasks like Sudoku and code editing.NVIDIA NVFP4 acceleration — native 4-bit floating-point support across Hopper, Blackwell, RTX PRO, DGX Spark, and DGX Station for near-lossless throughput gains.
Cloud and local deployment — runs on consumer GPUs, NVIDIA NIM, and the Gemini Enterprise Agent Platform Model Garden.
More details are available in the DiffusionGemma developer guide.
License: Apache 2.0
New Learning Resources
Here are some new learning resources that I came across this week.
Gemma 4 12B MTP Local Test with llama.cpp
This new video from Venelin Valkov puts the dense Gemma 4 12B model from Google DeepMind through a local test on llama.cpp, with the Multi-Token Prediction (MTP) setup enabled to measure how much faster inference becomes. It walks through hands-on benchmarks on coding, OCR, and visual RAG tasks, and runs a head-to-head against the 26B Gemma 4 MoE and Qwen3.6.
Column-Level Data Lineage Engine From Scratch
This new tutorial from George Yates (The Data and AI Guy) walks through building a column-level data lineage engine in roughly 1,000 lines of Python by parsing real SQL syntax trees instead of regex. It covers tracing dependencies across a 25-model warehouse, mapping a column’s exact blast radius, finding the root cause of final mart metrics, and wiring a CI gate that blocks breaking changes before they merge.
DuoBench: Benchmarking Planner / Implementer LLM Pairs
This new tutorial from Alejandro AO walks through DuoBench, a small Skill-shaped benchmark harness that runs planner → implementer combinations across Kimi K2.7, Kimi K2.6, GPT-5.5, and Claude Opus 4.8 on a recent CPython issue, then scores each commit against token cost. The headline finding: planning is cheap and implementation is where the bill grows, with Kimi K2.7 landing at the high-quality, low-cost corner of the chart both as a solo model and as the implementer behind a closed SOTA planner.
Build a Self-Healing CI/CD Pipeline with AI
This new tutorial from freeCodeCamp walks through wiring n8n, OpenAI, and GitHub Actions into a workflow that automatically detects, analyzes, and resolves CI/CD pipeline failures. It covers the workflow logic and architecture, setting up the local environment with Git and GitHub, building a Node.js / Express demo app, creating a smoke test script, implementing the GitHub Actions pipeline, and setting up the n8n automation account.
Hermes Agent Architecture Walkthrough
This new tutorial from Alejandro AO walks through the architecture of Hermes, his always-on AI agent. It covers a bird’s-eye view of the system, the agent loop, context construction and compression (including the compression prompt itself), gateway integrations with external services like Telegram and Slack, SQLite-backed memory, and scheduled cron jobs.
GitHub for Beginners
This free 16-video playlist from the GitHub team walks through the basics of Git and GitHub for new developers. It covers a brief introduction to Git, beginner Git commands with examples, creating a first repository, uploading files and adding code, opening and merging pull requests, GitHub Issues and Projects, GitHub Actions, GitHub security, GitHub Pages, Markdown, open source contributions, and using Git and GitHub inside VS Code.
Parallel Claude Code Agents with Git Worktrees
This new tutorial by Brianna Nicole walks through git worktrees as the missing piece for running multiple Claude Code sessions in parallel. It explains why two Claude Code sessions in one directory break, then covers the manual git worktree add / list / remove workflow, Claude Code’s built-in --worktree flag, the --tmux variant, .worktreeinclude for copying .env files, and five gotchas around shared databases, migrations, dependencies, and cleanup — with a live demo running two agents on the same Solana repo.
Book of the Week
This week’s focus is on a new book — Large Language Models: The Hard Parts: Open Source AI Solutions for Common Pitfalls by Thársis Souza, PhD and Jonathan Regenstein. The book is a practical examination of the technical challenges that arise when a working LLM moves from a demo to a real application. Instead of focusing on capabilities, it focuses on implementation pitfalls — evaluation, input management, testing, and safety — and pairs each chapter with reproducible Python code and open-source tooling. The goal is a grounded, builder-oriented view of LLM constraints and trade-offs, not another tour of the model frontier.
Topics Covered
Evaluation for nondeterministic systems — designing testing and evaluation strategies for systems whose outputs change run-to-run
Context management and RAG — handling input length, retrieval pipelines, and long-context retrieval in production
Output inconsistency and structural unreliability — addressing variability and broken structure in LLM outputs
Safety and content moderation — implementing guardrails and moderation frameworks around generated content
Alignment challenges and mitigations — exploring techniques for steering model behavior toward intended outcomes
Local open-source deployment — running and integrating open-source models locally rather than relying on hosted APIs
Reproducible Python implementations — every solution is paired with code and open-source tools the reader can run
Builder-oriented framing — written for engineers and technical product leads who ship LLM applications, not for researchers
This book is ideal for engineers, AI practitioners, and technical product leads who want a practical guide to deploying LLM-based applications and managing the trade-offs that surface beyond the prototype stage.
The book is available for purchase on the publisher’s website and on Amazon.
Have any questions? Please comment below!
See you next Saturday!
Thanks,
Rami
Thanks for reading Rami’s Data Newsletter! Subscribe for free to receive new posts and support my work.
Disclaimer: I use AI to help edit the content in this newsletter. If you spot a mistake, please leave a comment — I’d appreciate the heads-up.




