Recently, a situation came up where some alphanumeric data I was working with needed to be sorted in a way that felt more “natural.” Normally, when you’re sorting an alphanumeric list of data, it sorts by character code points. For example,
BAC012 ABC102 CAB210 ABC103
gets sorted as
ABC102 ABC103 BAC012 CAB210
That seems right … so what do I mean by “natural?” The “natural” descriptor is meant to provide an alternative label for how the numeric part of the alphanumeric data gets ordered. When we sort numbers, it’s sometimes awkward to sort them by character code points.
101 11 2 2003 31
That ordering isn’t wrong; but it’s usually not what we want. When sorting numbers, it’s more “natural” to sort numbers as numbers.
2 11 31 101 2003
Naturally, numbers, by default, get treated as numbers and sorted from low to high value, but when they are specifically part of an alphanumeric value, they’re sorted according to the code point, or value tied to that character.
With that understanding of “natural,” the problem I was working to solve led me to learning that you can create your own collation objects in Postgres. Within that section of the documentation, I found a nifty example provided.
Numeric ordering, sorts sequences of digits by their numeric value, for example: A-21 < A-123 (also known as natural sort).
CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true'); CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes');
Apparently this was a common enough desire, that they simply give an example in the docs. It seemed like exactly what I was looking for; the only problem was that I didn’t really know what anything after
numeric meant in the
CREATE COLLATION clause. And I wanted to understand it!
This is where the small task of simply creating and using the collation became a rabbit hole to Wonderland.
To get a better understanding, I read all the previous context in the Postgres docs and I followed the links about Unicode to discover much more than I expected. One example: I thought I knew what a collation is for, but if asked, I wouldn’t have had an answer. Now I do.
In case you’re like me and collations are a fuzzy concept, hopefully I can explain by simply saying it’s used for determining the sort order of a character set. Before, I think I saw collations as something that defined a character set. But according to Unicode, it’s specifically for defining how characters in a character set should get ordered. This sounds perfect if I want to treat numbers different from non-numbers in an alphanumeric string.
Now that I have that title out of my system, we can start breaking down what the
CREATE COLLATION clause is even saying.
But! Before I could look more into what ICU is, my brain needed to know if it was an acronym. So, for anyone with a brain like mine, I eventually found it (after more digging than I ever expected it to require). ICU is an acronym for “International Components for Unicode.”.
With that out of the way, the ICU
provider gives context about how to interpret the
locale value given. It’s a common Unicode syntax, and it’s also helpful for defining collations that are more than just language+country. There’s a lot about ICU, but for our purposes, the
locale value is where I’ll shift focus.
Both of the
CREATE COLLATION clauses use the same provider, but have two separate locale values that allegedly do the same thing:
en@colNumeric=yes. The Postgres docs mention that
en-u-kn-true is a “’language tag’ per BCP 47” while
'en@colNumeric=yes' is the “traditional ICU-specific locale syntax.” They also mention that the language tag is preferred but isn’t supported by older ICU versions. To be entirely honest, I didn’t look further into it. If you choose the one that makes the most sense to you right now, it might be fine. YMMV.
With that said, it’s possible you’re thinking what I was: what does that mean?
The separator can also be an underscore for backward compatibility reasons only. A dash is the preferred approach if your database supports it.
en-u-kn-true value is known as a Unicode locale identifier. Each section is preferably separated by a
en: The first part of the string refers to the language this locale string is being created for. If you wanted a region, you could do
enis also acceptable (case also seems to matter).
u: Figuring this out took some effort. Despite there being a lot written, so much of it had implicit assumptions. Eventually I tracked it down here and subsequently RFC6067 to determine that it is a Unicode locale language tag extension. By itself, it doesn’t mean much.
kn: With the
u-prefix before it, I looked at more extensions under that BCP 47 specification and found
knto be the Unicode locale extension for numeric ordering here.
true: This last part simply enables the
knextension to treat any sequence of decimal digits as digits for sorting numerically.
u-kn-true is the key section in the language tag that enables a numeric ordering within alphanumeric strings. For the second string, I’d assume it’s just a different way to say the same thing. (Spoiler alert: it is.)
en: This is the same as above.
colNumeric=yes: The best place I’ve found for finding this information is in the ICU GitHub repository. Within that file, the available values are broken down along with a small note about what each do. I tried a few sources, but nothing I found went over the traditional ICU-specific syntax as clearly. Per this page, the
@performs a similar duty to
uin the BCP 47 language tag by specifying what follows is an extension.
Information for determining what is right is sadly not cohesive. You really have to know what you’re looking for first. So if you’re needing to create a collation with different requirements, it may take some time to track down exactly what you want. Hopefully, I’ll have saved you some time.
Now for the straightforward part! Once I had the collation created, I applied it to the columns I wanted with:
ALTER TABLE my_table ALTER COLUMN my_column TYPE character varying(255) COLLATE numeric
Originally, I assumed I would need to group numeric values together and sort more manually, but thankfully Postgres is exceptional and manages this ordering..