Generating Identities

Last Updated 10 years ago

The only thing you ever need to use for database identity is an IDENTITY, right? Well, maybe. There are a lot of different options and they all have different pros and cons.

IDENTITY columns

The default way to identify objects in SQL Server is to use an INT or BIGINT column marked as an IDENTITY. This guarantees relatively sequential numbers, barring restarts and failed inserts. Using identity columns put the responsibility for creating and maintaining object identity in the database.

SQL Server will cache IDENTITY values and generate a new batch of identity values whenever it runs out. Because identity values are cached in memory, using identity values can lead to jumps in the sequence after SQL Server is restarted. Since identities are cached in memory in large batches, they make it possible to rapidly insert data – as long as disks are fast enough.

Sequences

Sometimes the application needs more control over identity. SQL Server 2012 added sequences. A sequence, unlike an identity value, is a separate object in the database. Both application and database code can read from the sequence – multiple tables can share a sequence for an identity column or separate sequences can be created for each table.

Developers using a sequence can use the CACHE value to cache a specific number of sequence values in memory. Or, if the application should have minimal gaps in the sequence, the NOCACHE clause should be used.

The Problem with Sequential Identities

Both IDENTITY and SEQUENCE values keep identity generation squarely in the database and, by using integral values, they keep the value narrow.

You can run into problems with sequential inserts on very busy systems – this can lead to latch contention on the trailing pages of the clustered index. This issue can be resolve by spreading inserts across the table by using a GUID or some other semi-random clustering key. Admittedly, most systems are never going to run into this problem.

GUIDs for Object Identity

Some developers use GUIDs as a way of managing object identity. Although database administrators balk at this, there are good reasons to use GUIDs for object identity.

GUIDs let the application generate object identity. By moving object identity out to the application layer, users can do work in memory and avoid multiple round trips to the database until they’re ready to save the entire set of data. This technique gives tremendous flexibility to application developers and users.

There’s one other thing that a well designed application gets from this technique – independence from the database. An application that generates its own identity values doesn’t need the database to be online 24/7; as long as some other system is available to accept writes in lie of the database, the application still function.

Using GUIDs for object identity does have some issues. For starters, GUIDs are much wider than other integral data types – 16 bytes vs 4 bytes (INT) or 8 bytes (BIGINT). This is a non-issue for a single row or even for a small database, but at significant scale this can add a lot of data to the database. The other issue is that many techniques for generating sequential GUIDs in the application (see NHibernate’s GuidCombGenerator) can still run into GUID collisions.

Integral Generators

What if you could get the best of both worlds? Applications generating unique identities that are also sequential?

The point of identity generation is to abstract away some portion identity from data attributes and provide an independent surrogate value. GUIDs can provide this, but they aren’t the perfect solution. Identity generators like flake or rustflakes promise roughly sequential identity values that are generated in the application layer and are unique across multiple processes or servers.

The problem with an external identity generator is that it is an extra piece of code that developers need to manage. External dependencies carry some risk, but these are relatively safe items that require very little effort implement and maintain.

The Solution

There’s no right solution, there’s only a solution that works for you. You may even use each solution at different points in the lifecycle of the same product. It’s important, though, for developers and DBAs to be aware of how identity is currently being handled, the issues that can arise from the current solution, and ideas of how to handle it going forward.

#SQLPASS Speakers: Here’s How to Improve Your Session

Developers: You Don’t Need a DBA. I’ll Prove It.

21 Comments. Leave new

Wyatt
August 14, 2014 9:07 am

I maintain an application that uses GUID PKs. I used to be a big fan, but as I’ve had to scale this application, I regret their use. The biggest table in my database is mostly just a bunch of FK columns. Because of the Guids, it takes up almost 4 times the space it would really need to. Also, sequential guid is an absolute must for the PK in this use case (since there no good available natural clustered key). Records tend to be inserted in batches of 10ish or so, and later looked up by the same FK. With a sequential guid on the clustered key, they will tend to be in the same leaf page. With random guids, they’re almost guaranteed to be on different leaf pages. Thankfully, gobs of RAM can make up for a lot of design sins.

Reply
Justin
August 14, 2014 9:29 am

Excellent. I often find myself looking at new concepts and the options within and asking myself “what’s the difference?” and then having a hard time finding succinct rundowns such as this one.

Easy to read, important to understand – no code to wade through….

Keep these coming.

Reply
- Jeremiah Peschka
  August 14, 2014 11:08 am
  
  Thanks for the kind words!
  
  I’m happy to keep providing no code to wade through – it’s as hard to write as it is to read.
  
  Reply
dave wentzel
August 14, 2014 11:02 am

Great article and should be a mandatory reading for people using ORMs that generate GUIDs for IDs. A couple points: If you insist on allowing your ORM to create the GUIDs for you to keep everything loosely coupled, then you probably shouldn’t use those GUID columns as clustered index keys.

I think the best solution in most cases is to use sequences in the database that are cached in the app. When the application exhausts its allocated IDs it requests a new block from the database. Then ORMs can do everything in memory without using GUIDs or making unnecessary db calls for ID management. I would like to see this become a standard feature of ORMs. Sequences are a feature of most RDBMSs and when they are missing (MySQL) they can be easily simulated.

Reply
John Reed
August 14, 2014 4:45 pm

So ran into this:

The primary key was a GUID and the table had a nvarchar(max) on it, which often got filled with rows averaging 1,702, and maxing out at, 4,697 charters in length (so big). The application tier determined the GUID (so random) and the table had several thousand inserts throughout the day…. oh and the table was big… if memory serves approaching 100 gb. This table also linked to several child tables using the guid as the foreign key.

Dev complained that inserts could get slow throughout the day. To improve performance, reduce cpu reads etc, I added a column that was an identity int and made it the primary key. Then added a Unique Non-Clustered index to the GUID. This insured all new inserts would get put into the bottom of the table, and with the unique index allowed me to keep the relationships in-place and unchanged.

Reply
- Justin
  August 14, 2014 4:51 pm
  
  @John, how much did this change impact the inserts’ performance in your case?
  
  Reply
  - John
    August 14, 2014 5:28 pm
    
    It only became a performance issue during high load. That being said, we had a trace that captured anything that ran longer than 2 sec, and until this change the insert and several other queries that hit the table appeared in trace on a pretty regular basis. My guess is it was because the GUID was random and each row was pretty big so sql had to spend a good bunch of time figuring out where to put it, and locked the table while figuring it out. With the new structure it just needed to validate the GUID hadn’t been used and added the row to the last “page” of the table.
    
    That being said… the change was made a while ago and I don’t have access to the “before and after” anymore.
    
    Reply
    - Kendra Little
      August 14, 2014 6:41 pm
      
      I would wonder what the wait stats were, etc — just because I’ve had quite different experiences. Thousands of inserts a day isn’t too bad.
      
      Back when I was first starting as a DBA, I managed a system that was entirely designed on GUID keys. Someone got the idea that it was all wrong, and a major investment was put into a redesign to cluster on IDs. The GUIDs were kept as nonclustered PKs. We had terabytes of data so this change took quite a while to deploy and manage.
      
      And after the change was done, performance was exactly the same. Months of work and nobody could measure a difference.
      
      I do think that there are some cases out there where it can make a difference, but like Jeremiah mentions, there are cases where you can get latch contention on an ascending int/bigint key too! But since I had that experience where it really wasn’t worth all the weekend hours we put in, I default to skeptic-mode on this topic 🙂
      
      Reply
      - IL
        August 15, 2014 3:08 am
        
        Thanks a lot, this article is just in time for we are in the move from ID database design to better distributed GUID design. Many data changes in databases will be distributed among all other databases while they are still separate, further all databases might merge into one. We need a GUID for that, but is a whole lot of work rewriting queries and applications. Keeping ID as IDENTITY clustered PK (technical PK) and making GUID unique non-clustered index for foreign relationship (true PK) might be a better idea.
        
        Though all joins and inserts into child tables instead of
        declare @id int
        insert A…
        set id=@@IDENTITY
        insert B(idA,…) select @id
        
        need to be reworked like this:
        declare @guid uniqueidentifier
        insert A…
        set @guid = (select uid from A where id=@@IDENTITY)
        insert B(uidA,…) select @guid
Mike Sofen
August 17, 2014 8:32 am

Excellent article, Jeremiah – it got me to reconsider (yet again) everything I “know” about row identification. My thoughts:

Modern hardware is so powerful that it tends mask many performance issues, so both of the issues mentioned (latching of IDs, performance of guids) are usually invisible.

What continues to bother me most about guids is their visual “unreadability”. During unit testing, engineers and analysts often have to spend hours/days/weeks pouring over the raw data, trying to ensure things are good. Guids make this process very fatiguing, especially since it’s a rare table that has just one (as a PK), and in unit test, we need to manually examine each of child rows (via foreign keys that are now guids).

The other issue I have with guids is storage and performance in very large/high speed transactional systems…which are typically the type of system where you’d want to leverage them (in a small, low tran rate system, the hardware will normally mask everything). In a semi-denormalized high speed OLTP system I might have 5-20 foreign keys in a table…visualize the impact of guids not only on writes at a million rows a day, but joining up all those child tables to serve up the data back to a UI.

In a guid world where the guid values are being distributed across many index pages, we end up with more index fragmentation more quickly, so defragging has to run very often.

Since a latch contention will only occur on an insert and modern SANs have enormous caches which buffer writes, that may explain why I haven’t seen that particular issue and why I continue to use the simple int/bigint identity on my tables.

When faced with the need to maintain global uniqueness across multiple servers/databases, the issue becomes more cloudy very quickly. I’ll leave that for another note.

Reply
- Jeremiah Peschka
  August 17, 2014 9:04 am
  
  Hi Mike, here’s where I respectfully disagree with everything you just said.
  
  Index fragmentation isn’t all that it’s cracked up to be. In the Oracle world, people have been saying “Why are you defragmenting? What problem are you actually solving?” since around 2000/2001. The SQL Server world is finally starting to slowly let go of the “defragmentation is good” mantra. The only thing that defragmentation really does is burn more cycles while you shuffle data around in order for it to fragment again.
  
  Like Kendra, I’ve been a part of projects to remove GUIDs from systems with a large transaction volume where we were sure that the GUIDs were a problem. After many long, painful migrations where GUIDs were slowly migrated to alternate, non-clustering, keys and finally were removed, we found that there was no measurable performance difference on the system.
  
  Your SAN cache, unless you’re a lucky duck, is shared among many many systems. And, even so, the SAN cache won’t be a factor in latch contention. Latching (or OS level locking) happens on the SQL Server machine and occurs as in-memory pages become a contention point. The SAN doesn’t factor into latch contention.
  
  Honestly, I’ve never seen GUIDs be a problem in a system apart from the size they take up. That can have far reaching consequences, but even those consequences can be mitigated through hardware and a careful approach to querying and design.
  
  P.S. 1 million inserts a day is 11.5 writes a second. That’s a laptop grade hard drive.
  
  Reply
  - Mike Sofen
    August 18, 2014 9:22 pm
    
    I used the million rows a day (into a single table out dozens) as a contrast to the thousands of rows a day example of a prior poster. Yes, the average row count per second is nominal – I’ve built systems doing thousands/sec – that wasn’t the point. And our SAN is somewhat burly, with a cache of 1.5TB on 12 nodes, so handles quite a bit of simultaneous stuff, and our standard servers are 40 core/512 gb ram (thus my latching comment) but that wasn’t the point either.
    
    My main points were that modern hardware (of all types) tends to mask a lot of design/coding issues, and it is fatiguing to unit test data with guids (for everyone). That by itself isn’t a deal killer, but when combined with no apparent performance gain or loss, then…what’s the point of using guids, unless a system requires globally unique identifiers…many enterprise systems don’t.
    
    In the OLTP systems I’ve had to work with, index fragmentation has been a major, measurable performance factor, so I’m surprised by your comment. My current project involves the real time integration of medical data from a hospital EMR and a clinical EMR, with total source systems storage over 11TB and many billions of rows.
    
    I’m leveraging SQL 2012 CDC in subject matter ETL pipelines where I time (and log) each loop duration for each pipeline (typically between 50 and 600 milliseconds). I can peg the relationship between loop execution times and fragmentation, and after defragging, loop times go back to baseline. Since it’s a new system, I can’t yet say where the sweet spot will be, but at the moment, weekly defragging is close enough.
    
    Since it’s an ODS, it has to both handle the real time integration/storage AND serve up data for new point of care apps…without blocking or delays. I can’t imagine doing that without defragging…unless I’ve been asleep and something new has replaced defragging.
    
    Reply
    - Dave Wentzel
      August 19, 2014 7:34 am
      
      Hey Mike, I worked for many years at an EMR vendor and I agree that your observation of loop execution times are correlated to fragmentation…but are you sure that implies causation? You really need to know what the wait stats are. Otherwise the cause could be something else. As Jeremiah mentioned, Oracle guys don’t care about fragmentation other than as an indicator that their index fillfactor may be wrong (called pctfree in Oracle). Once you fix the fragmentation it will likely begin to creep back in if your index keys are subject to updates or you do lots of deletes. Hence why you are defragging weekly.
      
      The only time I think it is prudent to point the finger squarely at fragmentation is when you know the index is normally scanned and you see the index involved in a lot of PAGEIO_LATCH waits.
      
      But I would be interested in knowing if that is a valid metric.
      
      Reply
      - Mike Sofen
        August 19, 2014 9:48 am
        
        Reasonable question re causation…I didn’t measure waits before defragging, so I can’t prove anything per se. However, I did see frag levels from 40-90%, and with no other changes to source systems, servers, data volumes, code, or target server configs, the defragging reduced loop times by 80%. To me, that IS causation.
        
        This is still in dev, so yes, I’m still doing a lot of batch deletes as I reload slices of data, and…doesn’t that actually support the notion that it was defragging as the root cause versus something else? This is where simply dismissing a possibility because “we’ve found other things to blame” isn’t a prudent path for my brain to follow.
        
        In the last few days since the last defrag, the frag levels on my busiest table (lab results) are back to 30-60%, with no physical deletes occuring (typical), and the 60% is an index on a child table key column that gets updated. Loop performance however, is still within the norm, supporting the alternate notion that fragmentation doesn’t necessarily cause performance degradation, to which I’d always agree.
        
        My experience also says that there is always a tipping point for a given table where the fragmentation impacts the performance. I tested that theory one day by setting the fill factor to 5% (yes indeed) as a simulation of fragmentation and watched the page counts explode and the performance crash to the basement. Setting it back to 95% reversed the behavior – that’s cause and effect, yes?
        
        Unfortunately at work, I’m unable to view the video suggested by Jeremiah, so I’ll do that at home and see if I get some new insights. I do appreciate the comments – this has been an excellent conversation!
      - Kendra Little
        August 19, 2014 10:03 am
        
        Your example with the 5% fillfactor is great! Let’s dig into that one a bit.
        
        5% fillfactor makes the page count go way up on a large table– and page count DOES matter. It’s a factor in what execution plans are generated! So rebuilding and lowering the fillfactor can generate different execution plans due to both the explosion in page count, as well as by triggering recompiles. The increase in pages to be scanned drives up CPU also.
        
        But thinking about 5% fillfactor more, it’s not fragmented in the sense of causing random IO, right? The pages are all in order, they’re just largely empty. That’s not really the same fragmentation you get from inserts causing page splits where the pages are physically out of order, but typically much more full than 5%. (Unless you never do index maintenance.)
        
        It’s easy to see fragmentation as a huge evil, because rebuilding indexes can trigger better execution plans. But often the problem was bad plans rather than the fragmentation. No matter what your clustering keys are, nonclustered indexes are always going to be prone to fragmentation, and the issue of “empty space” does mean that usually at least some periodic index maintenance is needed.
    - Jeremiah Peschka
      August 19, 2014 8:23 am
      
      Like Dave, I’m intrigued by how you’re correlating index fragmentation as the underlying problem.
      
      You might find Kendra’s video Why Index Fragmentation and Bad Statistics Aren’t Always the Problem (Video) interesting as it definitely looks like it’s relevant here.
      
      Reply
Martijn Evers
August 25, 2014 9:31 am

I’m missing bulk and batchloading discussions. In largely batch oriented systems, do GUIDS provide any benefit apart from cross system uniqueness?

The same goes for wide tables with 10 to 50! FK’s per row.

Personally I’m all for cached and controllable sequences, where I can get say a block of 1000000 incremental sequence values and apply them in bulk.

Reply
- Jeremiah Peschka
  August 25, 2014 9:40 am
  
  The more foreign keys you have, the more cognizant you need to be of the data type of the foreign key and the side effects that will have on the system. Wide keys will mean that related tables will be wider and there will be a bigger storage cost.
  
  Getting sequence values in bulk relies heavily on your methodology. I can’t provide anything approaching accurate advice in a comment like this. It could be as simple as creating an API to some library and calling the new sequence function. Or you could use SQL Server 2012 or newer and use a SEQUENCE and grab new values from the sequence after calling it 1,000,000 times.
  
  Reply
Aaron
June 10, 2015 2:26 am

Jeremiah,

I have a project in which we are capturing user inputs for different open groups we have created. We are planning for a peak of 2000 inserts per minute, in which 2000 is the number of open groups at a time, while simultaneously getting a peak of 5000 selects on the inserted data. The selects are being done on the group id of the data and each group is being estimated to max out at about 100 items each. There will also be another possible 10-15 thousand requests coming in at the same time (~80% read, 20% write) that will be accessing different data.

I’ve done some testing on the insert performance using bigint, int, uniqueidentifier and sequential uniqueidentifier as the primary key for the table. I don’t see much difference in the amount of time it takes for either one loading 5 million rows, it takes just over 4 mins for the fastest to 4 mins and 14 seconds for the slowest. And this is on the rusty spinning frisbees of a development machine and should be much faster in production which has SSDs for the storage.

With the amount of data potentially coming in at once I’m trying to decide if using a Guid for the primary key is the better choice to avoid the possible latch contentions that could arise with an integer and take the hit on it consuming more storage or if an integer would work just fine.

You made a mention in the initial post about latch contention only being a possible issue on a really busy system. Knowing that mileage may vary, are there any guidelines you would recommend to help in making that decision?

Reply
- Jeremiah Peschka
  June 10, 2015 7:23 am
  
  The only advice I have is to perform your own testing at a level that exceeds your wildest dreams of success and see if you experience the latch contention. I’ve only come across it once in 4+ years of consulting; that might tell you something about how common this problem is.
  
  Reply