Press "Enter" to skip to content

Curated SQL Posts

TheseusPlot 0.3.0 Released

Koji Makiyama announces an update to an R package:

TheseusPlot is an R package that decomposes differences in a rate metric between two groups into subgroup-level contributions and visualizes the results as a “Theseus Plot”.

For example, when a click-through rate, conversion rate, or retention rate differs between two time periods or groups, TheseusPlot helps answer questions such as: which subgroup contributed most to the difference?

I love the name and I think the plot concept is interesting, especially in the e-commerce context that Koji describes. H/T R-Bloggers.

Leave a Comment

Downtime Woes and Microsoft Fabric

Joey D’Antoni describes a problem:

All that being said, customers buy online services and expect them to be available. One of the reasons a company chooses Fabric, Databricks, or Snowflake is the notion that those platforms for Spark and various data warehousing options will be secured, patched, and better maintained than a non-technology company could do by simply deploying Spark into Kubernetes or VMware. With that, the cloud providers have an obligation to deliver services to their customers, and deliver availability and performance congruent with their pricing.

One of the things I expect from a cloud provider is honest post-mortems when they have an outage, and maintaining a history of their outages. These histories help architects better design systems, as we can better identify weaknesses in various cloud services that we might want to design around. Azure and AWS both do an excellent job of providing detailed information around “what happened” in incidents.

But as Joey mentions, Microsoft Fabric doesn’t have the same information. It also doesn’t have a dedicated SLA. But it does have a series of outages over the past month.

Leave a Comment

Tables without Histograms

Guy Glantser goes looking for stats:

A few weeks ago, I published a blog post titled “When Statistics Stay Empty Forever: A Hidden Edge Case with Empty Tables”

In that post, I described a scenario where statistics are created on an empty table, resulting in a statistics object without a histogram. If AUTO_UPDATE_STATISTICS is disabled and your maintenance solution doesn’t explicitly update statistics whose modification_counter is NULL, those statistics may never get a histogram – even after the table grows to millions or billions of rows.

Recently, I ran into another case involving missing histograms.

In this case, Guy found a table with 7 million rows and no histograms. Click through to see how that’s possible.

Leave a Comment

To Shrink, First You Must Grow

Eitan Blumin gives us a riddle:

Ever ran DBCC SHRINKFILE, watched it finish without a single error… and then discovered that your file is exactly the same size as before? You run it again. Same thing. And again. Nothing. Meanwhile you know there are gigabytes of empty space in there, taunting you. Welcome to one of the most maddening little corners of SQL Server.

Read on to understand why this happens and what you can do to fix it.

Leave a Comment

In Defense of DirectQuery

Chris Webb lays out an argument:

For as long as I’ve been using Power BI – which has been from the beginning – the advice about which storage mode to choose has been the same: use Import mode unless you have a really, really good reason to use DirectQuery mode and even then you’re probably wrong and should use Import mode. Import mode was always a lot faster and a lot easier to tune. Marco’s advice in this LinkedIn post from last year pretty much summed up my attitude and that of every other Power BI expert out there:

Click through for Chris’s argument. As a side note, I saw Conor Cunningham present a session this past weekend at SQL Saturday Austin on the GPU acceleration that Chris mentions in the Fabric Data Warehouse, and under certain circumstances (think “classic Kimball model”), it results in a 2-10x performance improvement, with the expectation somewhere around 3-5x. In other words, a query that takes 50 seconds to run and makes sense to run on GPU drops down to about 10 seconds with no development effort. For well-designed semantic models, this might be a significant factor that pushes the trade-off rubric in the direction of DirectQuery versus Import or DirectLake.

Leave a Comment

Invoke-Sqlcmd and Application vs Host Name

Greg Low notes a point of confusion:

I’ve always been a fan of having applications identify themselves in their connection strings. It makes tasks like tracing much easier. The tools supplied with SQL Server do a reasonable job of that as you can see above. But many other tools don’t do such a good job.

I was working at a site where they were using Powershell to execute commands for monitoring. I noticed that the Powershell commands did not set the Application Name in the connection string when using Invoke-Sqlcmd.

Read on to learn what Greg’s research discovered.

Leave a Comment

Clustering Text via Embeddings and HDBSCAN

Ivan Palomaras Carrascosa groups things together:

In this article, you will learn how to build a text clustering pipeline by combining large language model embeddings with HDBSCAN, a density-based clustering algorithm, to automatically discover topics in unlabeled text data.

Topics we will cover include:

  • How to generate text embeddings for raw documents using a pre-trained sentence-transformers model.
  • How to reduce the dimensionality of those embeddings with UMAP to prepare them for clustering.
  • How to apply HDBSCAN to automatically discover topic clusters and visualize the results.

This is a pretty neat trick that takes advantage of the embedding model’s ability to convert raw text into hundreds (or thousands) of floating point numbers while maintaining enough of the context to differentiate ideas. A lot of it is the original word2vec concepts but scaled up.

Leave a Comment

Alerting on Checkpoint Time in Postgres

Jeremy Schneider shares some advice:

Checkpoint is the heart of your database. It’s buried deep inside. It’s not something everyone talks about, like well-tuned autovacuum or fast queries. But if checkpointer stops beating, then you’re dead.

In addition to its well-understood job of getting dirty pages written from cache to disk in the background, it also has many smaller jobs that are less widely known. Management of a few shared-memory config settings like sync_standby_names and full_page_writes. Fsync Batching. Deferred file unlinks. Enforcement of archive_timeout.

Click through to see what happens when checkpoint time starts increasing and one important thing you should not do.

Leave a Comment

Handling Growth in sysjobhistory

Aaron Bertrand has a lot of jobs:

In the first few days of my new role at Infios, we came across an interesting case of memory exhaustion. A whole slew of memory-related error messages would populate the errorlog, then some stack and memory dumps would appear, and then the SQL Server service would just shut itself down without warning. Some of the errors we observed (apologies, it’s a long list, but I want to make sure that any subset might land you here):

Click through to see if you have any of those issues and one possibility of what the cause might be, as well as how to deal with it.

Leave a Comment

XML Support in MySQL and Postgres

Aisha Bukar lays out how XML works in a pair of relational platforms:

XML (Extensible Markup Language) may no longer dominate modern web APIs the way it once did, but it still plays a critical role in many enterprise systems. Financial institutions, publishing platforms, healthcare systems, government agencies, and large legacy applications continue to rely heavily on XML for structured data exchange and long-term interoperability.

XML also remains deeply embedded in technologies such as SOAP-based APIs, enterprise messaging systems, configuration files, and document-centric workflows where strict structure and validation are essential. This is largely because, unlike lightweight formats such as JSON, XML was designed to handle complex hierarchical documents, namespaces, schemas, and mixed content.

Read on to see how the two open-source relational database platforms handle XML data.

Leave a Comment