Statistical Analysis with Python, SQLGlot & MIT Deep Learning | Issue 71
A weekly curated update on data science and engineering topics and resources.
This week's agenda:
Open Source of the Week - The SQLGlot project
New learning resources - Setting Memory for AI application, MIT deep learning course, Statistical Rethinking 2026, introduction to AI engineering, LlamaIndex crash course
Book of the week - Statistical Analysis with Python For Dummies by Joseph Schmuller
The newsletter is also available on LinkedIn and Medium.
Enjoying this newsletter? Here’s how you can support:
Click 👍 and share ♻️
Have a Medium subscription? Please read it on Medium!
Are you interested in learning how to set up automated workflows with GitHub Actions? If so, please check out my course on LinkedIn Learning:
Open Source of the Week
This week’s focus is on the SQLGlot project - A powerful, no-dependency SQL parser, transpiler, optimizer, and execution engine written in Python.
GitHub: https://github.com/tobymao/sqlglot
🧠 What is SQLGlot?
SQLGlot is a versatile library for reading, manipulating, formatting, and translating SQL across many dialects. It ingests SQL text, generates an abstract syntax tree (AST), and produces syntactically correct SQL for a target dialect such as DuckDB, Spark, Snowflake, BigQuery, Postgres, MySQL, and many others. It also includes features for query analysis, syntax error detection, and basic execution.
🚀 Why It’s Useful
Dialect Translation: Simplifies migrating SQL across engines by automatically transpiling queries.
Parser & Formatter: Can parse diverse SQL inputs, ensure correct syntax, and produce clean, standardized output.
AST-Level Manipulation: Lets developers programmatically inspect and modify query structures, enabling tooling such as linters, analyzers, and automated refactors.
Optimizer & Analyzer: Includes basic optimization rules and expression tree traversal utilities.
Execution Engine: Although not designed for performance, SQLGlot can execute queries against Python data structures, useful for testing or in-memory analytics.
🛠️ Key Features
Transpile between 30+ SQL dialects (including Postgres, Spark, Hive, Snowflake, BigQuery, SQLite, TSQL, and more).
Detailed parser errors & warnings with context.
Extensible dialect system, so you can plug in support for custom SQL syntaxes.
AST diffing, helping compare and transform queries semantically.
Benchmarks show performance that’s competitive with other Python SQL parsers.
🧪 How to Get Started
Install from PyPI:
pip install "sqlglot[rs]"
Basic transpilation example:
import sqlglot
# Transpile a date/time expression from DuckDB SQL to Hive SQL
sqlglot.transpile("SELECT EPOCH_MS(1618088028295)", read="duckdb", write="hive")
AST traversal example:
from sqlglot import parse_one, exp
node = parse_one("SELECT a, b + 1 AS sum FROM my_table")
for col in node.find_all(exp.Column):
print(col.alias_or_name)
Syntax error handling:
import sqlglot
try:
sqlglot.transpile("SELECT foo FROM (SELECT baz FROM t")
except sqlglot.errors.ParseError as e:
print("Syntax error:", e)💡 Who Is It For
Data engineers migrating workloads across SQL engines.
Tool builders are creating IDE features like SQL formatters, analyzers, or linters.
Developers working with multiple database dialects and needing a common parser interface.
📌 Why It Matters
SQL remains the lingua franca of analytics and data engineering, but dialect differences (e.g., between Snowflake, Spark, Postgres, and BigQuery) create friction in cross-platform tooling and migrations. SQLGlot lowers that barrier by providing a single, extensible engine for parsing and translating SQL, helping teams maintain compatibility and consistency across systems.
More details are available in the project documentation.
License: MIT
New Learning Resources
Here are some new learning resources that I came across this week.
Setting Up a Memory for an AI Application from Scratch
By default, LLMs don’t come with memory, and each conversion is independent of the previous one. So, how do you add memory to an AI application? Apparently, it is simpler than it sounds, and it takes less than 5 lines of code to add one to your AI application by injecting the previous conversion to the prompt.
The following tutorial explains how to build a memory from scratch into an AI application and outlines its limitations.
Statistical Rethinking 2026
Statistical Rethinking is one of the great resources for getting started with Bayesian statistics. The course runs every winter (I believe this is the 6th year), starting in January 2026. It focuses on Bayesian methods and approaches to making inferences from data, with examples in R and STAN. Code in other languages is also available, including different flavors of R with the Tidyverse, ggplot2, and brms; Python with PyMC3; and Julia with Turing.
Introduction to Deep Learning - New MIT Course
MIT released a new course (recoreded Sping 2024) this week that focuses on Deep Learning. This hands-on course by Prof. Rama Ramakrishnan is a fast-paced introduction to deep learning, emphasizing understanding how those models work and how to leverage them to solve complex problems. It covers the following topics:
Training deep neural networks
Applications for computer vision
Building convolutional neural networks from scratch
Applications for natural language
Embeddings and transformers
LLMs and RAG
Fine-tuning LLMs
Introduction to AI Engineering
The following workshop by Louis-François Bouchard provides an introduction to the field of AI engineering. It covers topics such as AI engineering, workflow, agents, and multiple agents.
Introduction to LlamaIndex
An introduction to LlamaIndex, an AI agent and RAG framework for Python
Explaining Black-Box Transformer
A fascinating video that opens the black box of transformer algorithms and explains at which stage of training they shift from memorizing inputs to learning. Great illustration.
Book of the Week
This week’s focus is on a new stats book - Statistical Analysis with Python For Dummies by Joseph Schmuller. The book provides an introduction to statistical analysis with Python, from basic descriptive statistics and visualization through hypothesis testing and regression.
The book covers the following topics:
Getting started with Python for stats — setting up your Python environment and learning the core libraries and tools you’ll use for analysis.
Describing and visualizing data — methods to summarize and plot data to explore patterns and distributions.
Measuring central tendency and variation — understanding mean, median, standard deviation, and other key descriptive measures in practice.
Hypothesis testing fundamentals — concepts like sampling distributions, t-tests, ANOVA, and chi-square tests, with Python examples for implementation.
Regression and correlation — linear and multiple regression models, interpretation of coefficients, and analysis of relationships between variables.
Probability and modeling basics — foundational ideas in probability and using models such as logistic regression to analyze data.
Practical tools and tips — resources for Python users and guidance on applying statistical results to support data-driven decisions.
This book is ideal for beginners and professionals who want to learn how to use Python to analyze data, draw meaningful conclusions, and make evidence-based decisions, even with little to no prior coding or statistical background.
The book is available online on the publisher’s website, and a hard copy can be purchased on Amazon.
Have any questions? Please comment below!
See you next Saturday!
Thanks,
Rami




