Press "Enter" to skip to content

Curated SQL Posts

A Primer on Apache Iceberg

Walt Riberio lays out an explanation:

Data lakes had a reputation problem. The promise was compelling: dump all your data into cheap object storage—S3, GCS, Azure Blob—and query it whenever you need. The reality was a mess of stale partitions, schema drift, and silent data corruption caused by unsafe concurrent writes. Engineers knew the risks and worked around them rather than fixing them.

Apache Iceberg was built to fix that. And it’s catching on fast—even in the Kafka® world Iceberg is a hot topic. If you’re running ClickHouse® or building a pipeline that feeds into it, Iceberg is quickly becoming hard to ignore if you care about accuracy at scale. It’s becoming the connective tissue of the modern data stack.

Click through to learn more about the product and where it potentially fits in an organization.

Leave a Comment

The Ulam Prime Spiral

Tomaz Kastrun re-creates a classic:

Stanislaw Ulam, Los Alamos, 1963 was bored in a meeting and he started dooddling integers in a spiral and circled the primes. Diagonal lines appeared. He later showed it to Martin Gardner, to Ulam surprise, Gardner published his findings in Scientific American. We are still confused to this day.

Click through for a demonstration of this in action.

Leave a Comment

What pg_stat_statements Cannot Do

Radim Marek enumerates a list:

Part one made the core case: pg_stat_statements counts, it doesn’t record. It walked through how the queryid jumble fragments one logical query into many rows, how the first-seen text freezes your per-request tags, and how the averages bury the p99 that actually pages you. All of that was about data the extension has and distorts.

This part is about the rest: the entries it silently throws away, the query text that can vanish all at once, the plans and replicas it never records, and the knobs that bite. It ends where part one started, with the question the whole investigation was really about: is this the query store Postgres is missing, or just the floor you’d build one on?

Click through for those limitations.

Leave a Comment

The Limits of Parallelism with Fabric Dataflows Gen2

Chris Webb notes that parallelism does not mean “free performance improvements”:

To finish off my series of posts on concurrent evaluation in Fabric Dataflows Gen2 (see part 1 and part 2) I decided to do some more realistic tests to see how much parallelism I could get. To do this I uploaded 244 identical Excel files containing almost 542000 rows of data each to a SharePoint document library. Excel files are probably the worst-performing file format for dataflows (see here for some tests that show this), while SharePoint is probably the worst-performing place to store data for a dataflow and also has a reputation for throttling applications that make too many requests.

Click through for Chris’s test results.

Leave a Comment

Optional Parameter Plan Optimization

Hugo Kornelis concludes a mini-series:

It’s time to finish the triptych on bad parameter sniffing, and how Microsoft tries (and fails) to fix this for you.

After first talking about bad parameter sniffing in general, I used my last video to explain Parameter Sensitive Plan Optimization, the feature that Microsoft released in SQL Server 2022 as an attempt to fix one of the three root causes for bad parameter sniffing. If you want to see why I consider that a failed attempt, go watch that video.

In SQL Server 2025, Microsoft then added Optional Parameter Plan Optimization (OPPO). Their attempt to fix the second of those three root causes. Put where PSPO in most cases simply doesn’t do anything, or does do something, but not enough, OPPO does too much. It fixes bad parameter sniffing due to optional parameters in demos that are carefully curated to showcase the feature, but it does more bad than good in more realistic scenarios.

Read on for Hugo’s overview, as well as a video.

Leave a Comment

Vector Search with Oracle against Iceberg Tables

Brendan Tierney performs a search:

In my previous blog posts I’ve explored how to use Iceberg Tables and how to integrate these in with your Data Lake. Additionally, I showed how to setup your Oracle Data Lake (Database) to access the data stored in Iceberg Tables stored in OCI Object Storage. To access this Iceberg Table data from the Oracle Database we created an External Table. This allows us to query the Iceberg Table data as if it was internal to the database. With all new releases there is continuous improvement in the features and to make them easier to use. One such new feature (as of 23.26.1) is the ability to read vector data types from an External Table. This new feature is called or referred to as ‘Vectors on Ice’.

With Oracle Database External Tables now supporting vector embedding stored in Iceberg Tables, means you can generate vector embeddings with your preferred embedding model (external to Oracle using your faviourite tool/library), store them in Iceberg Tables in cloud object storage (OCI Object Storage, AWS S3, etc.), and run semantic search from Oracle AI Database, accessing vector data stored within the database and externally with the minimum of data movement and with similar SQL queries.

Click through to see how.

Leave a Comment

Working with Indexed Views

Erik Darling talks indexed views:

Anyway. T-SQL Server Management Studio. When most people think about index views, they rightfully think about all the stuff they can’t do with them.

And I sympathize with that because, man, so many times they’ve been like, oh, if only you could do this, if only you could do that, it sure would be nice. And I realize that all the air has gone out of the room as far as making index views more powerful because everyone’s like, well, you could just use batch mode.

Click through for some tips around update operations when dealing with indexed views, as well as a side rant about merge joins.

Leave a Comment

Fixing Blob Lease Issues on BACKUP TO URL

Greg Low fixes a problem:

Regardless of why you are using BACKUP TO URL, one of the problems that you are likely to run into at some point is the dreaded:

Msg 3271, Level 16, State 1, Line 60
A nonrecoverable I/O error occurred on file
“ https://somestorageaccount.blob.core.windows.net/backups/somedatabase.bak : “
Backup to URL received an exception from the remote endpoint. Exception Message: The remote server returned an error: (412) There is currently a lease on the blob and no lease ID was specified in the request…
Msg 3013, Level 16, State 1, Line 60
BACKUP DATABASE is terminating abnormally.

Click through to see when you might get this problem and how you can fix it.

Leave a Comment

Accessing Data between Private Link and Public Fabric Workspaces

Gilbert Quevauvilliers uses the private endpoint:

In this blog post I show how it is possible to access data between a Private Link Workspace, where I want to read the data from my Public Workspace.

An example of this is where I wanted to use a DirectLake Semantic Model sitting in my Public Workspace where the data is from data in my Private Link Workspace.

Click through to see how it works.

Leave a Comment

Recommendations for Purview Data Governance

James Serra provides some recommendations:

Microsoft Purview can be the best data governance tool in the world, but it will still be useless if people do not know it exists, do not trust the metadata, or do not change the way they work. That is the part that often gets missed. We sometimes think that buying or implementing a governance tool means governance is now “done.” I wish it worked that way. I really do. But the reality is that Purview can automate a lot, but it cannot magically fix missing metadata, undocumented business definitions, manual data movement, duplicate datasets, or people who keep building new reports without first checking whether the data already exists.

This blog is about best practices for Microsoft Purview data governance, not the data security and compliance side of Purview. I covered the broader value of Purview data governance in my post, Microsoft Purview: The key benefits of data governance, but here I want to get more practical and a bit more opinionated. Microsoft Purview has many capabilities across governance, risk, compliance, and security, but this post focuses on the governance experience: cataloging data, improving metadata, helping users find trusted data, understanding lineage, organizing data products, and making data easier to use. Microsoft also has helpful guidance, including data governance planningUnified Catalog planningPurview deployment best practicesdeployment checklist, and getting started but the real lesson is this: the tool is only as good as the operating model around it.

Pricing jokes aside, Purview is a very powerful products and there’s a lot of sound advice from James in this post.

Leave a Comment