Postgres Natural Sorting

Recently, a situation came up where some alphanumeric data I was working with needed to be sorted in a way that felt more “natural.” Normally, when you’re sorting an alphanumeric list of data, it sorts by character code points. For example,

BAC012
ABC102
CAB210
ABC103

gets sorted as

ABC102
ABC103
BAC012
CAB210

That seems right … so what do I mean by “natural?” The “natural” descriptor is meant to provide an alternative label for how the numeric part of the alphanumeric data gets ordered. When we sort numbers, it’s sometimes awkward to sort them by character code points.

That ordering isn’t wrong; but it’s usually not what we want. When sorting numbers, it’s more “natural” to sort numbers as numbers.

Naturally, numbers, by default, get treated as numbers and sorted from low to high value, but when they are specifically part of an alphanumeric value, they’re sorted according to the code point, or value tied to that character.

Postgres to the rescue

With that understanding of “natural,” the problem I was working to solve led me to learning that you can create your own collation objects in Postgres. Within that section of the documentation, I found a nifty example provided.

Numeric ordering, sorts sequences of digits by their numeric value, for example: A-21 < A-123 (also known as natural sort).

CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');
CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes');

Apparently this was a common enough desire, that they simply give an example in the docs. It seemed like exactly what I was looking for; the only problem was that I didn’t really know what anything after numeric meant in the CREATE COLLATION clause. And I wanted to understand it!

This is where the small task of simply creating and using the collation became a rabbit hole to Wonderland.

Down the rabbit hole

To get a better understanding, I read all the previous context in the Postgres docs and I followed the links about Unicode to discover much more than I expected. One example: I thought I knew what a collation is for, but if asked, I wouldn’t have had an answer. Now I do.

In case you’re like me and collations are a fuzzy concept, hopefully I can explain by simply saying it’s used for determining the sort order of a character set. Before, I think I saw collations as something that defined a character set. But according to Unicode, it’s specifically for defining how characters in a character set should get ordered. This sounds perfect if I want to treat numbers different from non-numbers in an alphanumeric string.

ICU but do UC me?

Now that I have that title out of my system, we can start breaking down what the CREATE COLLATION clause is even saying.

But! Before I could look more into what ICU is, my brain needed to know if it was an acronym. So, for anyone with a brain like mine, I eventually found it (after more digging than I ever expected it to require). ICU is an acronym for “International Components for Unicode.”.

With that out of the way, the ICU provider gives context about how to interpret the locale value given. It’s a common Unicode syntax, and it’s also helpful for defining collations that are more than just language+country. There’s a lot about ICU, but for our purposes, the locale value is where I’ll shift focus.

Nested rabbit holes

Both of the CREATE COLLATION clauses use the same provider, but have two separate locale values that allegedly do the same thing: en-u-kn-true and en@colNumeric=yes. The Postgres docs mention that en-u-kn-true is a “’language tag’ per BCP 47” while 'en@colNumeric=yes' is the “traditional ICU-specific locale syntax.” They also mention that the language tag is preferred but isn’t supported by older ICU versions. To be entirely honest, I didn’t look further into it. If you choose the one that makes the most sense to you right now, it might be fine. YMMV.

With that said, it’s possible you’re thinking what I was: what does that mean?

en-u-kn-true

The separator can also be an underscore for backward compatibility reasons only. A dash is the preferred approach if your database supports it.

The en-u-kn-true value is known as a Unicode locale identifier. Each section is preferably separated by a -.

en: The first part of the string refers to the language this locale string is being created for. If you wanted a region, you could do en_US or en_GB, but en is also acceptable (case also seems to matter).
u: Figuring this out took some effort. Despite there being a lot written, so much of it had implicit assumptions. Eventually I tracked it down here and subsequently RFC6067 to determine that it is a Unicode locale language tag extension. By itself, it doesn’t mean much.
kn: With the u- prefix before it, I looked at more extensions under that BCP 47 specification and found kn to be the Unicode locale extension for numeric ordering here.
true: This last part simply enables the kn extension to treat any sequence of decimal digits as digits for sorting numerically.

Overall, the u-kn-true is the key section in the language tag that enables a numeric ordering within alphanumeric strings. For the second string, I’d assume it’s just a different way to say the same thing. (Spoiler alert: it is.)

en@colNumeric=yes

en: This is the same as above.
colNumeric=yes: The best place I’ve found for finding this information is in the ICU GitHub repository. Within that file, the available values are broken down along with a small note about what each do. I tried a few sources, but nothing I found went over the traditional ICU-specific syntax as clearly. Per this page, the @ performs a similar duty to u in the BCP 47 language tag by specifying what follows is an extension.

Information for determining what is right is sadly not cohesive. You really have to know what you’re looking for first. So if you’re needing to create a collation with different requirements, it may take some time to track down exactly what you want. Hopefully, I’ll have saved you some time.

Applying the collation

Now for the straightforward part! Once I had the collation created, I applied it to the columns I wanted with:

ALTER TABLE my_table ALTER COLUMN my_column TYPE character varying(255) COLLATE numeric

Originally, I assumed I would need to group numeric values together and sort more manually, but thankfully Postgres is exceptional and manages this ordering..

Kenneth Bogner

Status

Double Agent

Code Name

Agent 0094

Location

Cleveland, OH

Github

Speed up your Rails app with upsert_all

Rails' ActiveRecord is—16 years in—almost feature complete, but Rails 6 introduces `upsert_all`. Here's how it can speed up some of the slowest parts of your app.

Publish Date: April 9, 2020
Authors: Justin Searls
Categories: Ruby; Upgrades

Enumerate your enums

Rails defaults can lead you astray when creating enum attributes. This guide will show you an easier and less error-prone way that leverages custom enum types in Postgres.

Publish Date: June 3, 2019
Authors: Justin Searls
Categories: Ruby

Mongo to Postgres

This morning, I finished a production migration of a client application off of MongoDB and onto PostgreSQL. I’ll document a few things I learned along the way here, and I’ll spare you arguments about the merits of why we decided to migrate.

Publish Date: June 23, 2014
Authors: Justin Searls
Categories: Development

Is your job okay? It could be better: autonomy to solve problems the right way, dedicated time to grow your skills, and empathetic colleagues to partner on your best work.

We're 100% remote and employee-owned.

Join us