Press "Enter" to skip to content

Curated SQL Posts

A Primer on Apache Iceberg

Brendan Tierney provides an introduction to Apache Iceberg:

Modern data platforms increasingly separate compute from storage, using object stores as durable data lakes while scaling processing engines. Traditional “data lakes” built on Parquet files and Hive-style partitioning have limitations around atomicity, schema evolution, metadata scalability, and multi-engine interoperability. Apache Iceberg addresses these challenges by defining a high-performance table format with transactional guarantees, scalable metadata structures, and engine-agnostic semantics.

Apache Iceberg, an open-source table format that has become the industry standard for data sharing in modern data architectures. Let’s have a look at some of the key features, some of its limitations and a brief look at some of the alternatives.

Brendan explains where Iceberg fits in relation to data formats (e.g., Parquet, ORC, and Avro), as well as competitors like Delta Lake and Hudi.

Leave a Comment

JSONB Data in Postgres and Performance Due to TOAST

Paul Ramsey lays out the facts and the data:

Working with APIs and arrays in the jsonb type has become increasingly popular recently, and storing pieces of application data using jsonb has become a common design pattern.

But why shred a JSON object into rows and columns and then rehydrate it later to send it back to the client?

The answer is efficiency. Postgres is most efficient when working with rows and columns, and hiding data structure inside JSON makes it difficult for the engine to go as fast as it might.

Read on to learn how Postgres manages to store arbitrary-sized JSONB data within the limitations of 8KB pages, and the performance implications of doing so.

Leave a Comment

Pain Points around Direct Lake

Teo Lachev describes a pair of problems:

I’m helping an enterprise client modernize their data analytics estate. As a part of this exercise, a SSAS Multidimensional financial cube must be converted to a Power BI semantic model. The challenge is that business users ask for almost real-time BI during the forecasting period, where a change in the source forecasting system must be quickly propagated to the reporting the layer, so the users don’t sit around waiting to analyze the impact. An important part of this architecture is the Fabric Direct Lake storage to eliminate the refresh latency, but it came up with a couple of gotchas.

Click through for those two problems.

Leave a Comment

An Overview of the Fabric Native Execution Engine

Ankita Victor-Levi introduces a new processing model:

In today’s data landscape, as organizations scale their analytical workloads, the demand for faster, more cost-efficient computation continues to rise. Apache Spark has long been the backbone of largescale data processing with its in‑memory processing and powerful APIs, but today’s workloads demand even better performance.

Microsoft Fabric addresses this challenge with the Native Execution Engine—a vectorized, C++ powered execution layer that accelerates Spark jobs with no code changesreduced runtime, and at no additional compute cost. This blog post will take you behind the scenes to give an overview of how the engine works and how it delivers performance gains while preserving the familiar Spark developer experience users already know and love.

Read on to learn more about its capabilities and current limitations.

Leave a Comment

Building Power BI Reports from the Desktop or Fabric

James Serra clears up some confusion:

If you’re a Power BI report author who’s just getting into Microsoft Fabric, you’ve probably asked the same question I hear over and over: am I supposed to stop using Power BI Desktop now?

It’s a fair question. Power BI Desktop is a Windows app that has traditionally been the place where report authors do everything: get data, transform it, model it, and build the report. Microsoft even describes that “connect, shape/transform, then load” experience as part of how Power BI Desktop works with Power Query.

Fabric changes the feel of that workflow because Power BI is now also a first-class experience in the browser inside the Fabric portal. And that browser experience isn’t just “view and share” anymore. You can edit semantic models in the service, including using Power Query for import models and building reports directly from that same environment.

Read on to see, for a brand new report, which of the two models can make the most sense.

Leave a Comment

Connection Pooling in PostgreSQL vs SQL Server

Haripriya Naidu compares two systems:

If you speak SQL Server as your first language, then you might be aware that connections are thread-based by design. That means each session/connection in SQL Server gets a worker thread. That thread is tied to that session from start to finish of execution.
If there are no available threads, new connections wait in queue until threads become available. This is called a thread-based model.

Postgres is different, it uses a process-based model. Every single connection spawns a separate backend OS process and each of it consumes RAM (>5MB per connection).

It’s interesting that the RDBMS that really “needs” connection pooling doesn’t have it built in, whereas the one that doesn’t “need” connection pooling (but can still benefit greatly from it) does.

Leave a Comment

Combining UNION and UNION ALL

Greg Low crosses the streams:

Until the other day though, I’d never stopped to think about what happens when you mix the two operations. I certainly wouldn’t write code like that myself but for example, without running the code (or reading further ahead yet), what would you expect the output of the following command to be? (Note: The real code read rows from a table but I’ve mocked it up with a VALUES clause to make it easier to see the outcome).

Read on to see what happens.

Leave a Comment

Tracking Typing Speed with R

Tomaz Kastrun is pushing aside Mavis Beacon:

Did you ever wonder how fast and with accuracy your typing is?

For this instance, we will introduce some random pangrams, code samples and random strings sotrted by level of difficulty.

This was kind of fun. I could hit about 80 or so WPM on the easy code examples and about 120 on the pangrams (with consistency between difficulties). Also, “Sphinx of black quartz judge my vow” is a pretty awesome thing to shout at the most opportune time.

Leave a Comment

Failure Tracking in SSIS

Andy Brownsword keeps a log:

SSIS packages provide great flexibility for integration between systems, but when they go wrong you can end up digging through logs or reports because every package logs differently. A standarised framework for tracking failures can drastically cut down troubleshooting time.

reminisced recently about old code, I said “it’s not enough to make it work correctly. It needs to fail correctly too”. So in this post we’ll demonstrate a simple way to consistently track errors and failures in packages to help make troubleshooting much easier.

My recollection is that this kind of failure logging is less important if you have the SSISDB catalog, as it collects a lot of the information as well. But then again, I haven’t really used SSIS in a while, so that memory could be fuzzy.

Leave a Comment

Linked Servers in SQL Server 2025 and Strict TLS

Rebecca Lewis notes a common failure point:

If you upgrade to SQL Server 2025 and your linked servers stop working, you are not alone. This is the single most common post-upgrade failure I am seeing right now, and it hits almost every environment that has linked servers configured from an older version. SQLNCLI is gone. The replacement driver has different defaults. Your connections will fail unless you explicitly tell them how to encrypt.

Read on for the correct solution, the mostly-correct solution, and the solution that a lot of people will take but will probably burn them in a few years.

Leave a Comment