Press "Enter" to skip to content

Curated SQL Posts

Comparing Techniques for Text Featurization in Classification Problems

Ivan Palomaras Carrascosa tries a few things:

In this article, you will learn how Bag-of-Words, TF-IDF, and LLM-generated embeddings compare when used as text features for classification and clustering in scikit-learn.

Topics we will cover include:

  • How to generate Bag-of-Words, TF-IDF, and LLM embeddings for the same dataset.
  • How these representations compare on text classification performance and training speed.
  • How they behave differently for unsupervised document clustering.

Click through for results. Granted, the specific embedding model can alter the quality of results, but even so, I do enjoy the comparison of techniques and the reminder that neural networks aren’t the ultimate solution to everything.

Leave a Comment

A Review of the Portmanteau Theorem

Ben Smith digs into a theorem:

The Portmanteau Theorem provides a set of equivalences of weak convergence that still remains relevant for establishing asymptotic results in probability and statistics. While the theory around weak convergence is well developed, I was inspired to put together a writeup proving all the equivalences in a self contained manner, by first presenting the relevant theorems applied (without proving them) along with along with a visual on the implication cycle created for the proof and some discussion about other presentations available in popular textbooks and some historical notes.

Click through for the PDF.

Leave a Comment

Web Scraping with Python

Jason Yousef has a script:

Below is a production-friendly pattern that:

  • Uses a requests.Session with retries, backoff, and a real User-Agent
  • Sets sane timeouts and handles common HTTP errors
  • Respects robots.txt (and tells you if scraping is disallowed)
  • Parses only mailto: links by default to avoid scraping personal data you shouldn’t
  • Handles pagination with a “Next” link when present
  • Exports to CSV
  • Can be run from the command line with arguments

Click through for the code, some explanation of how it works, and a few tips.

Leave a Comment

Running Totals over Arbitrary Date Ranges

Louis Davidson solves an interval problem:

Say you want to find the most recent 30-day period during which a person purchased some amount of products from your company. How you market to a customer might change if they have been active over a time period recently, or even in the past. But this also means that for each day going back in history, you need to sum historic data over and over, and the previous 29 days of activity. This is generally known as a rolling total. Doing this sort of calculation has been an interesting problem for many years.

When window functions came around, they became quite useful for such tasks, but they have one kind of complicated problem: gaps in source data patterns.

Funnily enough, there is a solution using window functions: range intervals. The ANSI SQL definition for RANGE (versus ROWS) for window functions does allow for the specification of a date range, like RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW. Very impressive.

Unfortunately, SQL Server doesn’t support these. PostgreSQL does, but it’s an area I’ve agitated about for a few years and I do hope that someday, the SQL Server product team will support this functionality. In the meantime, Louis has a solution that works well for the task.

Leave a Comment

Read Efficiency in PostgreSQL Queries

Michael Christofides explains what’s happening under the covers:

A lot of the time in database land, our queries are I/O constrained. As such, performance work often involves reducing the number of page reads. Indexes are a prime example, but they don’t solve every issue (a couple of which we’ll now explore).

The way Postgres handles consistency while serving concurrent queries is by maintaining multiple row versions in both the main part of a table (the “heap”) as well as in the indexes (docs). Old row versions take up space, at least until they are no longer needed, and the space can be reused. This extra space is commonly referred to as “bloat”. Below we’ll look into both heap bloat and index bloat, how they can affect query performance, and what you can do to both prevent and respond to issues.

Read on for a detailed explanation.

Leave a Comment

Using Database Properties to Assist Generative AI Solutions

Brent Ozar makes use of extended properties:

You can add database instructions as extended properties at the database or object level, and when Copilot works with those objects, it’ll read your instructions and use them to shape its advice.

For example, you can add a database-level property called a “constitution” with your company’s coding standards, like this:

Andy Brownsword has another example:

The new Database Instructions are text stored against database objects to add more context about the object and how it should be used. A simple example:

We find a Sales table with a Price column. Is that the price for a single unit or the line total? Does that include or exclude VAT? What about discounts?

This is where context is king, and Database Instructions allow us to annotate these details and remove the ambiguity.

Database properties are a criminally underused part of SQL Server—in part because there wasn’t great tooling around how to display or work with these properties—and if this forces people to be a bit thoughtful in design and after-the-fact documentation on database objects, so much the better.

Leave a Comment

Code Pages in PostgreSQL on Windows

Kellyn Gorman tells a story:

Running PostgreSQL on Windows feels deceptively simple for anyone with a Windows laptop that just wants a local database to test or demo on.  Just a few clicks and I’ve installed it, started the service, open psql and I’m up and running.
Except… not quite.
Because if you’ve ever seen this message:

Console code page set to 1252 for psql compatibility

You’ve stepped into one of the more subtle, frustrating challenges of running PostgreSQL on Windows. So, let’s talk about why this happens, what it is and why it matters more than you might think…

This has several downstream problems, as Kellyn points out. Read on to see how you can fix the issue.

Leave a Comment

When Warehouse Beats Lakehouse

Gilbert Quevauvilliers runs a test:

After my previous blog post on the different semantic model options and at the same time working with a Fabric customer, it got me thinking which is faster and which consumes less capacity when ingesting data into Power BI either via the SQL Endpoint to a Lakehouse or a query from the Warehouse.

Below you will find the information which I found very interesting indeed.

For both the Lakehouse and Warehouse source CSV’s there was a total of 237,245,585 rows.

Click through for the numbers, and a scenario in which the warehouse loads data faster than a lakehouse.

Leave a Comment

Load Testing Microsoft Fabric Redux

Reitse Eskens goes back to the well:

For those of you who have been following this blog for a long time, you may have read the posts on Fabric where I’m comparing the F64 Trial with the F2, and other shenanigans. Because Fabric keeps evolving, and new releases keep coming that improve or change the behaviour, I felt it was only fair to give Fabric a new run for its capacities.

The idea is not to create a solution that works as quickly as possible. It’s not the goal to tune Fabric, nor to get the most excitement for your Euro. The main goal of this blog (and the session that I’m presenting on this topic), is to show you the differences, the error messages and where to look when you get lost. Because, for all its intents and purposes, error handling is still tricky, and it seems to be very hard to get rid of “Something went wrong” messages.

I appreciate Reitse’s localization of the well-known phrase “most win for your Yen.”

Click through for plenty of graphs and lots of testing.

Leave a Comment