Page Comparison

In part one we I covered Case Sensitivity of the Data Definition Language (DDL), in part two we I will move on to the data itself and once again examine some best practices that have worked for me.Image Removed

Before we get start started please note this article is written from the stand point of an English speaking developer, ; case sensitivity and collation and what the expected behavior is are greatly affected by other languageschoice of language, and if you are attempting to support non-Latin based languages please take this article with a grain of salt.

It turns out that case sensitivity of the data is a two part issue. We must consider both the sorting of the data and the comparison of the data. We I will cover the sorting of data first as it is fairly easy to resolve.

...

I won't bore you with a lot of technical details of the low level inner workings of string sorts, suffice to say that Collation collation is what controls the behavior and if you Google that you will find more information that you could ever want. With each new release of PostgreSQL it has gotten a little more accommodating when it comes to Collation collation and in fact with 9.1 you can even specify Collation collation at the field level.

So what exactly is Collationcollation? It controls how characters are treated for sorting. Since PostgreSQL came from the Unix world it use uses "C" or "POSIX" as it's its normal method of Collationcollation. Which in simple terms (and probably not 100% accurate but close enough for our conversation) means that it really sorts by the ASCII value of the characters. Which means that lowercase "c" is 99 and uppercase "C" is 67, and an uppercase "D" (68) is going to be considered coming before that lowercase "C". Let's look at a couple of examples. With my collation set to English_United States.1252 I get the following results.

...

However if my database were set for "C" or "POSIX" collation I would get a completely different result. With 9.1 we I can see that by forcing a "C" collation on our my order by statement

select lastname from customer order by lastname collate "C"

I get this result:

ADAMS
ADAMSON
ADams
ADamson
Adams
Adamson
BONES
BOnes
Bones
adams
adamson
bones

Not what most of us would expect at all. The good news is that its a simple matter to create our your database with a collation that works for usyou. And in fact the installation I am using for Windows came from EnterpriseDB and uses English_United States.1252 by default. On my Centos 6.3 installation of PostgreSQL 9.1.4 it defaults to en_US.UTF-8 as the collation, which also sorts as we I would like. The default is determined based on your operating system installation, so you may see different results, but the bottom line is when creating a database make sure you use a collation that will sort as you like.

...

In simple terms "Adams" does not equal "adams" as far as PostgreSQL is concerned. Especially for developers moving from MS SQL Server that is not the expected behavior. We are use to string comparisons not being case sensitive.

Before we I go much further let me climb up on my soapbox for a second. Yes, I know I so rarely do that! Whenever I discuss this issue with developers, they always react as if its the end of the world, and I admit I did as well when I first ran into it. But then after taking a couple of deep breaths, really think about how often you do string comparisons. If you really start looking at your code, I will bet you will discover not as often as you thought. Even in our large system, with 340+ tables that we converted from SQL Server to PostgreSQL, we found that there just were not that many places.

We use, and I hope you do as well, auto-numbered record ids for all tables and the links between tables are accomplished with those. Sorting isn't as issue as we I already covered above. So it came down to a few places where we were doing filtering instead of complete comparisons, such as LastName like 'Adam%'. And a few other places where we used a column to tell us which table a record id was pointing to. For example we used a generic Phone table and it had two fields LinkTable and LinkSysId to get back to Customer, Vendor, etc. These type of fields we have control over as a developer so we can make sure they respect a certain case rule, such as all lower, all upper, etc. PostgreSQL provides us with the case insensitive like command command called ilike that will perform like commands as we would expect. There are issues with the use of indexes, so there could still be some performance considerations but you should at least consider and understand how much case sensitivity is really going to effect you before moving forward.

...

In true Uncle Pete fashion I will start with a war story, back

Back in the day, we lowly PostgreSQL developers didn't have an easy ways around the case sensitivity issue. To make my first PostgreSQL database work similar to how it worked on SQL Server I had to actually override the functionality of all the varchar comparison functions. It's simply amazing that we even have that ability, as its it's not something you could do with most databases and how we did it is outside the scope of this article, but in simple terms we forced everything to lower case for the comparisons, so So instead of the command being

where varchar = varchar

...

it became,

where lower(varchar) = lower(varchar)

...

You could accomplish the same thing by simply coding all of your SQL statements that way, but since we had millions of lines of existing code that was not a practical solution for us, so we modified the commands themselves so anytime varchars where compared it would happen. The problem with this our solution was indexes would not be used, ; however another great feature of PostgreSQL is functional indexes. This means that we are able to create indexes that use functions such as:

...

So we could just stop there, I could tell you how to change the varchar comparison commands and you could have a case insensitive database in PostgreSQL and as long as you remembered to also create the matching functional indexes it would even perform well. I still have database databases running this way today. But as is always the case with a system-changing functionality like this, sooner or later you are going to run into something that causes you issues , (like we did with older versions of Access and Windev).Update:

Warning
After discussing this with John Armstrong we discovered the V17 of Windev still has issues with functional indexes with more than one field in them. You can not import the tables into the Analysis and far worst, some of them (the combination of a integer field plus a lower(text) field for instance) will cause a GPF when using an HAdd command.

The citext solution

Starting with PostgreSQL version 8.4 a new module is available, citext. citext basically does exactly the same as what we were doing, but it does it in a much cleaner fashion by creating a new variable type. There are two steps to enabling it. First the module must be installed, and second the extension has to be created in schema you are working on. Installing the module depends greatly on your particular distribution of operating system, and PostgreSQL and is outside the scope of this document, Google should provide you information for how to install the module. If you need help send me a private message and I will be glad to help. Once

Once the module is installed, you need to issue the the following command: create extension "cittext" in the schema you are working in. By using citext as the variable type we you create a column that is case insensitive for comparisons.

...

The advantage to this approach is that the index created on the citext column will also be case insensitive and we you don't have to remember to create special indexes. Again its not all rainbows and unicorns though. The citext variable type does not allow us you to specify a length. This does cause WinDev a few minor issues. When importing the tables, WinDev sees the field as being unlimited text and creates it as a memo field and also gives us a warning warns that it is ignoring the fact that it has an index.

...

Its a simple matter to change it the field in the analysis to a standard string and set its length and set it as a key, and once we you do WinDev will treat it as we you would like and I . I haven't ran into an issues in my initial test.

...

Will the citext variable type cause you issues with other tools? I wouldn't be surprised if older version of Access has issues with it, but then older versions of Access have issues with lots of things and I am glad to say I no longer have to support Access. I haven't ran run into any issues in WinDev at this point, but I am in the initial testing states of using this method in a production environment. I will of course be sure post my results as we I move forward.

Theoretically someone could create a value via SQL larger than the value you are allowing in WinDev but there a many ways to handle that, such as using triggers to truncate the values, firing the idiot developer, etc. So I really don't see the database not have a length limit on the field as an issue.

There is a slight performance hit with citext as any commands have to convert the text to lowercase for comparisons etc. but that discussion is really purely academic unless you are churning through billions of rows of text data attempting to do case insensitive matching and if . If that the case you will probably need to explore other options such as full text indexes, but in a standard OLTP system you should never notice an issue.

...

Summary

So there you have it: to solve case sensitivity with PostgreSQL do the following:

Take a deep breath and remember its not as big an issue as it sounds
Create your DDL in lowercase
Make sure that your collation settings for the database are going to give you the desired sorting
Use the citext variable type for any column that you want to have case insensitive comparisons on.
As an alternative to #4 if you will only ocassionly being doing it or in only a few places consider using ilike and where lower(varchar) = lower(varchar) instead.
If its going to be a common search be sure to also create the appropriate index on the column, for citext that is a normal index but if you are doing ilike and lower(varchar) then you will need to create a functional index that is doing lower(varchar)
If you get stuffstuck, take another deep breath, remember Google is your friend, and if all else fails give Uncle Pete a call.

Versions Compared

Old Version 1

New Version 2

Key

The citext solution

Summary