Press "Enter" to skip to content

Curated SQL Posts

Clustering Text via Embeddings and HDBSCAN

Ivan Palomaras Carrascosa groups things together:

In this article, you will learn how to build a text clustering pipeline by combining large language model embeddings with HDBSCAN, a density-based clustering algorithm, to automatically discover topics in unlabeled text data.

Topics we will cover include:

  • How to generate text embeddings for raw documents using a pre-trained sentence-transformers model.
  • How to reduce the dimensionality of those embeddings with UMAP to prepare them for clustering.
  • How to apply HDBSCAN to automatically discover topic clusters and visualize the results.

This is a pretty neat trick that takes advantage of the embedding model’s ability to convert raw text into hundreds (or thousands) of floating point numbers while maintaining enough of the context to differentiate ideas. A lot of it is the original word2vec concepts but scaled up.

Leave a Comment

Alerting on Checkpoint Time in Postgres

Jeremy Schneider shares some advice:

Checkpoint is the heart of your database. It’s buried deep inside. It’s not something everyone talks about, like well-tuned autovacuum or fast queries. But if checkpointer stops beating, then you’re dead.

In addition to its well-understood job of getting dirty pages written from cache to disk in the background, it also has many smaller jobs that are less widely known. Management of a few shared-memory config settings like sync_standby_names and full_page_writes. Fsync Batching. Deferred file unlinks. Enforcement of archive_timeout.

Click through to see what happens when checkpoint time starts increasing and one important thing you should not do.

Leave a Comment

Handling Growth in sysjobhistory

Aaron Bertrand has a lot of jobs:

In the first few days of my new role at Infios, we came across an interesting case of memory exhaustion. A whole slew of memory-related error messages would populate the errorlog, then some stack and memory dumps would appear, and then the SQL Server service would just shut itself down without warning. Some of the errors we observed (apologies, it’s a long list, but I want to make sure that any subset might land you here):

Click through to see if you have any of those issues and one possibility of what the cause might be, as well as how to deal with it.

Leave a Comment

XML Support in MySQL and Postgres

Aisha Bukar lays out how XML works in a pair of relational platforms:

XML (Extensible Markup Language) may no longer dominate modern web APIs the way it once did, but it still plays a critical role in many enterprise systems. Financial institutions, publishing platforms, healthcare systems, government agencies, and large legacy applications continue to rely heavily on XML for structured data exchange and long-term interoperability.

XML also remains deeply embedded in technologies such as SOAP-based APIs, enterprise messaging systems, configuration files, and document-centric workflows where strict structure and validation are essential. This is largely because, unlike lightweight formats such as JSON, XML was designed to handle complex hierarchical documents, namespaces, schemas, and mixed content.

Read on to see how the two open-source relational database platforms handle XML data.

Leave a Comment

Optimizing Power BI Data Agents

Paul Turley shares some advice:

Amid the AI frenzy, there is a lot of conversation about how business users will use agentic chat to answer business questions rather than interactive, dashboard-style reports. Is there truly a shift in the industry, and is agentic analytics going to change the way most business users consume data?

Just how viable is the whole “chat with your data” option, and is it really a replacement for conventional reporting? I recently heard a VP-level leader at a large consulting firm say something to the effect of “we need to stop investing in dashboard-building skills and focus on creating AI-driven data analysis solutions for our consulting customers.” I’m paraphrasing from memory, but that was the sentiment. Are all business leaders across the industry giving up their dashboards, interactive visual reports and scorecards in exchange for AI chat? No. Of course they aren’t — but conversational analysis is a new way to consume business data.

Much of the advice is very similar to what you’d get for standard dashboard creation, and it makes sense. The clearer your data model is and the tighter your semantic model is, the easier it is for processes to use that semantic model. But Paul also covers some things specific to Data Agents as well.

Leave a Comment

Snapshot Testing in R

Jakub Sobolewski drills into a particular form of testing:

Snapshot testing is not about screenshots.

Most people meet it through UI regression tests: render a component, save a picture, fail the build when the picture changes. So the technique gets filed away as “the thing that compares images.” That is one use. But not the only one.

The mechanic underneath is general. Capture some output, save it to a file, and on every later run compare fresh output against the saved copy. The output can be a plot. It can also be console text, a log, a data frame, an error message, or a deeply nested list. Anything you can serialize, you can snapshot.

Read on to see how you can perform snapshot testing, using examples in R to demonstrate. H/T R-Bloggers.

Leave a Comment

Trusting Outputs from Fabric Data Agents

Jens Vestergaard says don’t trust, do verify:

In two previous posts I went down the path of getting a semantic model ready for AI: descriptions on every measure, an instructions file, the schema tidied up enough that a Fabric data agent has something real to read. That work has a satisfying endpoint. The model looks ready.

Ready is not the same as right.

The kind of evaluation Jens is talking about is fundamental to good business intelligence practices, regardless of whether you throw language models into the mix. Where language models do add complexity is the arbitrary scope of questions, how ambiguous people tend to be when writing, and the stochastic nature of answers. All of that makes the problem harder, though at least it isn’t an entirely different class of problem to solve.

Leave a Comment

What’s Common in Regular Expressions

John Cook muses on regular expression libraries:

The most frustrating aspect of regular expressions is that implementations vary. Features supported in one tool may not be supported at all in another tool, or they may be supported with slightly different syntax.

I learned regular expressions in the context Perl, a maximalist regex environment. This led to frustration when features I expect to work are missing [1]. One way around this is to use Perl analogs of other tools, but this is very non-standard. I want to be able to send colleagues and clients code that works out of the box.

Click through for some thoughts about the lowest common denominator for what products tend to support around regex. This is one of several tricky things when working with regular expressions: you may know a great way to solve a specific class of problem, but does the particular engine you’re using actually support that method?

Leave a Comment

An Introduction to Memory Grants in SQL Server

Erik Darling has a new video for us:

Erik monitoring tool mogul darling here with Darling Data. In today’s video, much like I think I foreshadowed in yesterday’s office hours video, we are going to talk about memory grants. We’re going to do a somewhat gentle introduction to them and then in the next video we’ll talk a little bit more about where they get interesting. 

Erik shares his perfectly reasonable take on the nature of strings. I’d probably also get rid of dates as well—too much confusion there for my taste.

Leave a Comment

SQL Formatting in SSMS 22.7

Chad Callihan tries out a new feature:

Code formatting can be a touchy subject. Sometimes there are clear rules to designate right and wrong, and sometimes there’s not. Tabs versus spaces, anyone?

Surprisingly, SQL Server Management Studio has never had a built-in SQL formatter. Users were always left to use third-party tools or format by hand. But with the latest SSMS 22.7, SQL formatting finally comes built-in.

Let’s look at some examples and see how it performs.

Chad also spotted a problem in the formatter as it is in that release.

Leave a Comment