Press "Enter" to skip to content

Curated SQL Posts

Generating Local Text Embeddings in SQL Server 2025

Greg Low continues a series on local text embeddings:

In the first article of this series, I explained how to install and configure Ollama to host text embeddings models locally. I also demonstrated how to install Caddy as a proxy to allow SQL Server to use Ollama via https-based calls. In this article, I’ll show you how to make use of this at the SQL Server end.

Greg mentions a few embedding models, but the one I’m pushing right now is Microsoft Research’s Harrier OSS v1 model, specifically the 600m parameter version. It does extremely well in the MTEB leaderboards (the 27b variant is at the top of the board as of June 2026) and has a permissive license. It generates vectors in 1024 dimensions, so the embeddings are a bit chunky, taking up 4kb apiece. But the results are really good.

Leave a Comment

Bloom Filters with Valkey

Jay Miller checks for a record:

A bloom filter is a small, probabilistic data structure designed to answer one question: “Have I seen this item before?” It provides two potential answers: Absolutely Not, and Probably.

You may think that 100% Yes or No would be better but here’s the thing, probably is really fast and you’re really concerned about the Absolutely Not’s taking up unnecessary connections.

The article speaks to Aiven’s implementation of a Bloom filter in Valkey, but does get into some neat details on bloom filters in general. And if you want to go further down that route, Paul White explains how SQL Server uses Bloom filters.

Leave a Comment

Testing SQL Code in Python

Jamal Hansen writes some tests:

I once had a query that ran fine for months. Then someone added a column to the source table and a SELECT * downstream started returning unexpected data. The query didn’t error. It just silently gave wrong results. A test would have caught it immediately.

Schema changes break queries silently. Refactoring a CTE can shift results in ways you don’t notice. New data patterns expose assumptions you didn’t know you made. SQL deserves the same testing discipline as the rest of your code, and Python makes it straightforward.

PyTest, the library Jamal uses here, is one of my favorites for this kind of work. You can build up tests without a lot of ceremony and it’s pretty easy to deal with for most use cases.

Leave a Comment

The Pain of Moving Indexes between Filegroups

Erik Darling explains a process:

At some point you’re going to want to move some indexes to a new filegroup. Maybe you’re separating data across storage, maybe you’re cleaning up after someone who put everything on PRIMARY and walked away, maybe you’ve got your reasons and they’re none of my business.

Whatever the cause, you’d think this would be a solved problem in a database that’s been around since the Clinton administration.

It is not.

Some days, I’m convinced that the only way to win is not to play at all. Erik explaining how to migrate LOB data across filegroups fits that bill perfectly.

Leave a Comment

A Primer on Apache Iceberg

Walt Riberio lays out an explanation:

Data lakes had a reputation problem. The promise was compelling: dump all your data into cheap object storage—S3, GCS, Azure Blob—and query it whenever you need. The reality was a mess of stale partitions, schema drift, and silent data corruption caused by unsafe concurrent writes. Engineers knew the risks and worked around them rather than fixing them.

Apache Iceberg was built to fix that. And it’s catching on fast—even in the Kafka® world Iceberg is a hot topic. If you’re running ClickHouse® or building a pipeline that feeds into it, Iceberg is quickly becoming hard to ignore if you care about accuracy at scale. It’s becoming the connective tissue of the modern data stack.

Click through to learn more about the product and where it potentially fits in an organization.

Leave a Comment

The Ulam Prime Spiral

Tomaz Kastrun re-creates a classic:

Stanislaw Ulam, Los Alamos, 1963 was bored in a meeting and he started dooddling integers in a spiral and circled the primes. Diagonal lines appeared. He later showed it to Martin Gardner, to Ulam surprise, Gardner published his findings in Scientific American. We are still confused to this day.

Click through for a demonstration of this in action.

Leave a Comment

What pg_stat_statements Cannot Do

Radim Marek enumerates a list:

Part one made the core case: pg_stat_statements counts, it doesn’t record. It walked through how the queryid jumble fragments one logical query into many rows, how the first-seen text freezes your per-request tags, and how the averages bury the p99 that actually pages you. All of that was about data the extension has and distorts.

This part is about the rest: the entries it silently throws away, the query text that can vanish all at once, the plans and replicas it never records, and the knobs that bite. It ends where part one started, with the question the whole investigation was really about: is this the query store Postgres is missing, or just the floor you’d build one on?

Click through for those limitations.

Leave a Comment

The Limits of Parallelism with Fabric Dataflows Gen2

Chris Webb notes that parallelism does not mean “free performance improvements”:

To finish off my series of posts on concurrent evaluation in Fabric Dataflows Gen2 (see part 1 and part 2) I decided to do some more realistic tests to see how much parallelism I could get. To do this I uploaded 244 identical Excel files containing almost 542000 rows of data each to a SharePoint document library. Excel files are probably the worst-performing file format for dataflows (see here for some tests that show this), while SharePoint is probably the worst-performing place to store data for a dataflow and also has a reputation for throttling applications that make too many requests.

Click through for Chris’s test results.

Leave a Comment

Optional Parameter Plan Optimization

Hugo Kornelis concludes a mini-series:

It’s time to finish the triptych on bad parameter sniffing, and how Microsoft tries (and fails) to fix this for you.

After first talking about bad parameter sniffing in general, I used my last video to explain Parameter Sensitive Plan Optimization, the feature that Microsoft released in SQL Server 2022 as an attempt to fix one of the three root causes for bad parameter sniffing. If you want to see why I consider that a failed attempt, go watch that video.

In SQL Server 2025, Microsoft then added Optional Parameter Plan Optimization (OPPO). Their attempt to fix the second of those three root causes. Put where PSPO in most cases simply doesn’t do anything, or does do something, but not enough, OPPO does too much. It fixes bad parameter sniffing due to optional parameters in demos that are carefully curated to showcase the feature, but it does more bad than good in more realistic scenarios.

Read on for Hugo’s overview, as well as a video.

Leave a Comment