PostgreSQL Case Sensitivity Part 2 (The Data)

In part one I covered Case Sensitivity of the Data Definition Language (DDL), in part two I will move on to the data itself and once again examine some best practices that have worked for me.

Before we get started please note this article is written from the stand point of an English speaking developer; case sensitivity and collation and what the expected behavior is are greatly affected by choice of language, and if you are attempting to support non-Latin based languages please take this article with a grain of salt.

It turns out that case sensitivity of the data is a two part issue. We must consider both the sorting of the data and the comparison of the data. I will cover the sorting of data first as it is fairly easy to resolve.

Case Sensitivity of Sorts

I won't bore you with a lot of technical details of the low level inner workings of string sorts, suffice to say that collation is what controls the behavior and if you Google that you will find more information that you could ever want. With each new release of PostgreSQL it has gotten a little more accommodating when it comes to collation and in fact with 9.1 you can even specify collation at the field level.

So what exactly is collation? It controls how characters are treated for sorting. Since PostgreSQL came from the Unix world it uses "C" or "POSIX" as its normal method of collation. Which in simple terms (and probably not 100% accurate but close enough for our conversation) means that it really sorts by the ASCII value of the characters. Which means that lowercase "c" is 99 and uppercase "C" is 67, and an uppercase "D" (68) is going to be considered coming before that lowercase "C". Let's look at a couple of examples. With my collation set to English_United States.1252 I get the following results.

select lastname from customer order by lastname
adams
Adams
ADams
ADAMS
adamson
Adamson
ADamson
ADAMSON
bones
Bones
BOnes
BONES

However if my database were set for "C" or "POSIX" collation I would get a completely different result. With 9.1 I can see that by forcing a "C" collation on my order by statement

select lastname from customer order by lastname collate "C"

I get this result:

ADAMS
ADAMSON
ADams
ADamson
Adams
Adamson
BONES
BOnes
Bones
adams
adamson
bones

Not what most of us would expect at all. The good news is that its a simple matter to create your database with a collation that works for you. And in fact the installation I am using for Windows came from EnterpriseDB and uses English_United States.1252 by default. On my Centos 6.3 installation of PostgreSQL 9.1.4 it defaults to en_US.UTF-8 as the collation, which also sorts as I would like. The default is determined based on your operating system installation, so you may see different results, but the bottom line is when creating a database make sure you use a collation that will sort as you like.

Case Sensitivity of Comparisons

Now for the real topic everyone has been waiting for for 1 1/2 articles. Case sensitive comparisons!!!

In simple terms "Adams" does not equal "adams" as far as PostgreSQL is concerned. Especially for developers moving from MS SQL Server that is not the expected behavior. We are use to string comparisons not being case sensitive.

Before I go much further let me climb up on my soapbox for a second. Yes, I know I so rarely do that! Whenever I discuss this issue with developers, they always react as if its the end of the world, and I admit I did as well when I first ran into it. But then after taking a couple of deep breaths, think about how often you do string comparisons. If you really start looking at your code, I will bet you will discover not as often as you thought. Even in our large system, with 340+ tables that we converted from SQL Server to PostgreSQL, we found that there just were not that many places.

We use, and I hope you do as well, auto-numbered record ids for all tables and the links between tables are accomplished with those.  Sorting isn't as issue as I already covered above. So it came down to a few places where we were doing filtering instead of complete comparisons, such as LastName like 'Adam%'. And a few other places where we used a column to tell us which table a record id was pointing to. For example we used a generic Phone table and it had two fields LinkTable and LinkSysId to get back to Customer, Vendor, etc. These type of fields we have control over as a developer so we can make sure they respect a certain case rule, such as all lower, all upper, etc. PostgreSQL provides us with the case insensitive like command called ilike that will perform like commands as we would expect. There are issues with the use of indexes, so there could still be some performance considerations but you should at least consider and understand how much case sensitivity is really going to effect you before moving forward.

I am guessing that I didn't convince you to take case sensitivity for what it is and not make any other changes. So with a big sigh I will move forward.

In true Uncle Pete fashion I will start with a war story

Back in the day, we lowly PostgreSQL developers didn't have an easy ways around the case sensitivity issue. To make my first PostgreSQL database work similar to how it worked on SQL Server I had to actually override the functionality of all the varchar comparison functions. It's simply amazing that we even have that ability, as it's not something you could do with most databases and how we did it is outside the scope of this article, but in simple terms we forced everything to lower case for the comparisons, So instead of the command being  

where varchar = varchar

it became

where lower(varchar) = lower(varchar)

You could accomplish the same thing by simply coding all of your SQL statements that way, but since we had millions of lines of existing code that was not a practical solution for us. The problem with our solution was indexes would not be used; however another great feature of PostgreSQL is functional indexes. This means that we are able to create indexes that use functions such as:

CREATE INDEX ixcustomer2
 ON customer
 USING btree
 (lower(lastname));

Since we now had an index that matched our comparisons the index would be used and there were no performance issues. But life still wasn't all rainbows and unicorns. Since functional indexes are something not seen in many RDBMS systems many third party tools don't handle them well. We had issues with older versions of Access importing table definitions that had functional indexes. We also had trouble working with tables with functional indexes in Windev. I am happy to say in my testing it seems that version 17 of Windev has corrected issues with functional indexes and everything seems to work fine.

So we could just stop there, I could tell you how to change the varchar comparison commands and you could have a case insensitive database in PostgreSQL and as long as you remembered to also create the matching functional indexes it would even perform well. I still have databases running this way today. But as is always the case with a system-changing functionality like this, sooner or later you are going to run into something that causes you issues (like we did with older versions of Access and Windev).

After discussing this with John Armstrong we discovered the V17 of Windev still has issues with functional indexes with more than one field in them. You can not import the tables into the Analysis and far worst, some of them (the combination of a integer field plus a lower(text) field for instance) will cause a GPF when using an HAdd command.  

The citext solution

Starting with PostgreSQL version 8.4 a new module is available, citext. citext basically does exactly the same as what we were doing, but it does it in a much cleaner fashion by creating a new variable type. There are two steps to enabling it. First the module must be installed, and second the extension has to be created in schema you are working on. Installing the module depends greatly on your particular distribution of operating system, and PostgreSQL and is outside the scope of this document, Google should provide you information for how to install the module. If you need help send me a private message and I will be glad to help.  

Once the module is installed, you need to issue the the following command: create extension "cittext"  in the schema you are working in. By using citext as the variable type you create a column that is case insensitive for comparisons.

CREATE TABLE customer2
(
 custsysid serial NOT NULL,
 firstname character varying(20),
 lastname citext,
 CONSTRAINT pkcustomer2 PRIMARY KEY (custsysid )
)
WITH (
 OIDS=FALSE
);

The advantage to this approach is that the index created on the citext column will also be case insensitive and you don't have to remember to create special indexes. Again its not all rainbows and unicorns though. The citext variable type does not allow you to specify a length.  This does cause WinDev a few minor issues. When importing the tables, WinDev sees the field as being unlimited text and creates it as a memo field and also warns that it is ignoring the fact that it has an index.

Its a simple matter to change the field in the analysis to a standard string and set its length and set it as a key, and once you do WinDev will treat it as you would like.  I haven't ran into an issues in my initial test.

Will the citext variable type cause you issues with other tools? I wouldn't be surprised if older version of Access has issues with it, but then older versions of Access have issues with lots of things and I am glad to say I no longer have to support Access. I haven't run into any issues in WinDev at this point, but I am in the initial testing states of using this method in a production environment. I will of course be sure post my results as I move forward.

Theoretically someone could create a value via SQL larger than the value you are allowing in WinDev but there a many ways to handle that, such as using triggers to truncate the values, firing the idiot developer, etc. So I really don't see the database not have a length limit on the field as an issue.

There is a slight performance hit with citext as any commands have to convert the text to lowercase for comparisons etc. but that discussion is really purely academic unless you are churning through billions of rows of text data attempting to do case insensitive matching. If that the case you will probably need to explore other options such as full text indexes, but in a standard OLTP system you should never notice an issue.

If you need to perform both case sensitive and case insensitive comparisons on the same field, although for the life of me I can't imagine why, then citext would not be a solution for you.

Relativity speaking citext is fairly new and there have been a few issues with it in previous releases of PostgreSQL, but most of them seem to be corrected with Version 9.1 and so far the release notes for 9.2 don't indicate anything related to citext needing fixed.

Summary

So there you have it: to solve case sensitivity with PostgreSQL do the following:

  1. Take a deep breath and remember its not as big an issue as it sounds
  2. Create your DDL in lowercase
  3. Make sure that your collation settings for the database are going to give you the desired sorting
  4. Use the citext variable type for any column that you want to have case insensitive comparisons on.
  5. As an alternative to #4 if you will only ocassionly being doing it or in only a few places consider using ilike and  where lower(varchar) = lower(varchar) instead.
  6. If its going to be a common search be sure to also create the appropriate index on the column, for citext that is a normal index but if you are doing ilike and lower(varchar) then you will need to create a functional index that is doing lower(varchar)
  7. If you get stuck, take another deep breath, remember Google is your friend, and if all else fails give Uncle Pete a call.