Use email address as primary key?

sql database database-design postgresql

48,480

Solution 1

String comparison is slower than int comparison. However, this does not matter if you simply retrieve a user from the database using the e-mail address. It does matter if you have complex queries with multiple joins.

If you store information about users in multiple tables, the foreign keys to the users table will be the e-mail address. That means that you store the e-mail address multiple times.

Solution 2

I will also point out that email is a bad choice to make a unique field, there are people and even small businesses that share an email address. And like phone numbers, emails can get re-used. [email protected] can easily belong to John Smith one year and Julia Smith two years later.

Another problem with emails is that they change frequently. If you are joining to other tables with that as the key, then you will have to update the other tables as well which can be quite a performance hit when an entire client company changes their emails (which I have seen happen.)

Solution 3

the primary key should be unique and constant

email addresses change like the seasons. Useful as a secondary key for lookup, but a poor choice for the primary key.

Solution 4

Disadvantages of using an email address as a primary key:

Slower when doing joins.
Any other record with a posted foreign key now has a larger value, taking up more disk space. (Given the cost of disk space today, this is probably a trivial issue, except to the extent that the record now takes longer to read. See #1.)
An email address could change, which forces all records using this as a foreign key to be updated. As email address don't change all that often, the performance problem is probably minor. The bigger problem is that you have to make sure to provide for it. If you have to write the code, this is more work and introduces the possibility of bugs. If your database engine supports "on update cascade", it's a minor issue.

Advantages of using email address as a primary key:

You may be able to completely eliminate some joins. If all you need from the "master record" is the email address, then with an abstract integer key you would have to do a join to retrieve it. If the key is the email address, then you already have it and the join is unnecessary. Whether this helps you any depends on how often this situation comes up.
When you are doing ad hoc queries, it's easy for a human being to see what master record is being referenced. This can be a big help when trying to track down data problems.
You almost certainly will need an index on the email address anyway, so making it the primary key eliminates one index, thus improving the performance of inserts as they now have only one index to update instead of two.

In my humble opinion, it's not a slam-dunk either way. I tend to prefer to use natural keys when a practical one is available because they're just easier to work with, and the disadvantages tend to not really matter much in most cases.

Solution 5

I don't know if that might be an issue in your setup, but depending on your RDBMS the values of a columns might be case sensitive. PostgreSQL docs say: „If you declare a column as UNIQUE or PRIMARY KEY, the implicitly generated index is case-sensitive“. In other words, if you accept user input for a search in a table with email as primary key, and the user provides "[email protected]", you won't find “[email protected]".

View more solutions

48,480

robert

Hello World!

Updated on December 21, 2020

Comments

robert over 3 years

Is email address a bad candidate for primary when compared to auto incrementing numbers?

Our web application needs the email address to be unique in the system. So, I thought of using email address as primary key. However my colleague suggests that string comparison will be slower than integer comparison.

Is it a valid reason to not use email as primary key?

We are using PostgreSQL.
- onedaywhen over 13 years
  
  What do you mean by 'primary'? If the email address needs to be unique then it is a key and requires a unique constraint. Whether you decide to 'promote' it be being 'primary' is arbitrary, unless there is a practical reason for doing so e.g. optimizing a poorly performing system.
- James Westgate over 13 years
  
  If you want your database to enforce a unique email address, then create a column with a unique index, but dont use it as the primary key.
- systempuntoout over 13 years
  
  @robert What if someone wants to change his email address? Are you going to change all the foreign keys too?
- onedaywhen over 13 years
  
  @James Westgate: what is the distinction between "unique index" and "primary key" that you are making? Is it the same distinction that the OP is making? For me, the difference between a unique constraint and a primary key are so subtle they aren't worth bothering with.
- James Westgate over 13 years
  
  @onedaywhen - hardly any difference, but the primary key will be clustered by default, whereas a unique index wont be. You will still want to define the primary key which will be the default single record lookup key, the unique index merely enforces the uniqueness of the column over a normal index
- MikeD over 13 years
  
  a primary key is usualy used to be referenced by child entities. there may also be exports of child data to secondary systems/DB's. As an email tends to change over time this change needs to be cascaded. As a primary key I would use something less likely to change than an email address
- Matthew Wood over 13 years
  
  @James Westgate: FYI, there's no such thing as automatic clustering in PostgreSQL. A PRIMARY KEY is implemented on disk exactly the same as a UNIQUE INDEX where all the fields are NOT NULL.
- James Westgate over 13 years
  
  Hmm ok. Dont know about postgres. Must have been added to the question. Certainly in MSSQL there is a physical order.
- onedaywhen over 13 years
  
  @James Westgate: so back to you original comment: "create a column with a unique index, but dont use it as the primary key" -- do you now agree that there is "hardly any" basis for recommending this? Even in SQL Server land, where you can explicitly define a table's clustered index (and is perhaps better done that way), it makes hardly any difference.
- James Westgate over 13 years
  
  No not at all. In 'Sql Server Land' you wouldnt cluster an index on a key you are not using eg email. In Sql Server land you would have your primary ie clustered index on an integer value. You would however enforce the uniqueness of the email address by using a unique index in addition to the primary key. As I said I know nothing about postgres. Hope that clears that one up for you.
- onedaywhen over 13 years
  
  @James Westgate: In 'Sql Server Land you can achieve all you said without using PRIMARY KEY. So what's the justification for using PRIMARY KEY at all?
- James Westgate over 13 years
  
  Primary key has to be target of a foreign key.
- dburges almost 11 years
  
  @JamesWestgate, not true, a foreign key can be linked to a PK or a field with an existing unique constraint in SQL Server.
- James Westgate almost 11 years
  
  That is somethign I didnt know - thanks HLGEM
Sjoerd over 13 years

Why is it "better"? Any reasons or sources?
Sjoerd over 13 years

Can you elaborate on that?
sleske over 13 years

+1 for mentioning the cascading update problem. That's why friends let friends only use surrogate keys ;-).
dburges over 13 years

@sleke I love the phrase:"That's why friends let friends only use surrogate keys
Unreason over 13 years

ah, I don't like the saying at all... surrogate keys can also be the source of problems; yes, the application will be more robust to change of business and/or integrity rules, however the information can get lost a bit easier and the identity of records becomes less clear. so I would not recommend a rule of a thumb here...
dburges over 13 years

I have never in 30 years of database work lost records due to a surrogate key.
onedaywhen over 13 years

The OP states very clearly: "Our web application needs the email address to be unique in the system" so your observation about "people and... businesses that share an email address" do not apply in this case.
onedaywhen over 13 years

"Always give due consideration before you implement requirement x just because your application needs requirement x." -- the worst piece of advice I've read in quite some time.
Jay over 13 years

How could people share an email address? If email addresses are not unique, the whole email system breaks down. Unless you mean that, say, a husband and wife might share a single email account. But in that case, are they really two different customers? Now it depends on what your definition of a "customer" is. I've used email addresses as primary keys and never had a problem with that. Maybe in some cases it would be an issue. Have to look at the context.
onedaywhen over 13 years

A property of a good key is that is should be stable but NOT necessarily immutable.
jrharshath over 13 years

I'm not convinced by your "argument" -- in real life there will often be situations when some essential data (e.g, a phone number) will not be available immediately. If such a field is marked as NOT NULL in a database, it will require the users to pollute the data with dummy fields (like 123) instead of leaving it empty. It would be more practical to let the application handle the constraints (and in this case, the app could flag an empty field as a action item).
Jay over 13 years

How would changing email addresses cause there to be duplicates? Unless user A changes his email address, and then user B changes his email to be the same as user A's old value, and your updates are not done in sequence. Remotely possible, I guess.
Jay over 13 years

I agree that defining a field "not null" should be done cautiously. Requirements like "we always need the customer's phone number" should be considered carefully. Might it not be desirable at times to create a customer record even though we don't know the phone number right now, and go back and get it later? But "this field must be unique" is a different category. I can't imagine saying "It's okay for two employees to have the same social security number, we'll figure it out later." How would you ever straighten out the data?
dburges over 13 years

@onedaywhen and @jay, just because you think it shoud be unique doen't make it unique. And yes a husband and wife might be different customers. Just becasue you haven't run into this before doesn't mean it won't happen. I have run into it and it does happen which is why email should never be allowed to be considered unique whether you think it should be or not. This is the kind of requirement you push back because it is inherently wrong.
meriton over 13 years

A foreign key reference, by definition, contains the value of the primary key of the row it refers to. Put differently, it duplicates the value of the primary key. (So the duplicating is not caused by changing the value. But changing is harder due to this duplication, and the constraint enforcing it).
Jay over 13 years

@HLGEM: I don't want to get into an endless argument, but you can't say that a proposed key is not unique based on hypotheticals without knowing the context. e.g. from the phone company's point of view, a telephone number uniquely identifies a customer, by definition. Yes, you can say, "But what if there are two or three people who might answer when you call that number?" But this is irrelevant. From the phone company's point of view, by definition this is one customer. (continued ...)
Jay over 13 years

(continued) Likewise, if you are building a system that is largely concerned with email communications -- perhaps a message dispatching system, or a notification forwarding system -- then it is likely that by definition, an email address uniquely identifies a user. If multiple people share that email address, that is irrelevant. They are a single message destination, therefore, they are a single user. "User" and "customer" do not have to be synonyms for "individual human being".
Bill Karwin over 13 years

@onedaywhen: Yep! Otherwise why would SQL support cascading updates?
Steven A. Lowe over 13 years

if you have a choice, go for constant/immutable keys; less work for you down the road; just because SQL supports cascading updates doesn't mean it's always a good idea!
Aaronaught over 13 years

More likely, the e-mail address is going to be used as a login ID. That might inconvenience 1% of the customers but the other 99% will have a much happier experience relative to having to choose and remember a unique user name. The only other alternative is OpenID which most laypeople don't have.
Matthew Wood over 13 years

@Conrad: Although, he does point out that it's not a PITA if you have a engine that supports ON UPDATE CASCADE. It's a non-issue at that point code-wise; the only real issue is how extensive is the update and how wide is the key. Email address may be a bit much, but a CASCADE UPDATE for a PK of 2-character country code isn't a big deal.
Micah over 13 years

Unless you care little to nothing about performance then use a GUID. It's no-no #1 if you are building a system that will need to scale
Tim Rourke over 13 years

"change like the seasons"... I just got done with a project where the users' e-mails changed frequently, and without any notice to application managers or developers.
Conrad Frix over 13 years

@Matthew IMHO its still a PITA. For example assume that when you designed your country table there were only two tables that referenced it, no biggy, But over time it became 20 tables each with hundreds of thousands of records. Some with the reference some without. This makes a single logic write end up being tens of thousands of writes, and it doesn't make it to all the tables because someone forgot a reference when the added the table. This is exact thing happened to me on a 2 char country code table I kid you not.
user1601201 over 13 years

no... see davybrion.com/blog/2009/05/…
Jay over 13 years

@Wood & Conrad: The worst case is when there's no built-in DB support. Then you have to write code for it for every table with a posted reference, and this is just a pain and a door for bugs to slip in. With the cascades, you just have to remember to add one clause on each table, not such a big deal.
Ash over 13 years

Advantage 1 and 3 are premature optimizations, advantage 2 is a very minor benefit and is completely overcome by any decent query tool.
onedaywhen over 13 years

Many websites I frequent require that I supply a unique email address to identify me: Amazon, Google, Microsoft, etc. Please advise how I can "push back" this requirement to them.
onedaywhen over 13 years

... though I'm not sure I want to: I assume they, as I do, like the idea that an email address is verifiable (they require that I respond to an email they send) and familiar to users (i.e. I can remember my own email address!), both of which are properties of a good key.
Vincent Malgrat over 13 years

+1: relying on cascading updates is bad practice (it brakes db normalization)
onedaywhen over 13 years

@Vincent Malgrat: "cascading updates... brakes db normalization" -- methinks you have misunderstood the concept of normalization!
Vincent Malgrat over 13 years

@onedaywhen: in a normalized db, you should not have the same information repeated on multiple rows. An email address as primary key will be repeated on all foreign keys.
Jay over 13 years

@Ash: Thee's a difference between "optimizatin" and "premature optimization". But okay, by the same reasoning, all the disadvantages I've seen anyone mention are premature optimizations. So where does that leave you? As to #2, I find typing in extra joins when trying to do ad hoc queries to be a major pain. Records often have multiple foreign keys so you may need several joins to get to comprehensible data. If by "decent query tool" you mean one that figures out what data you want to see without you telling it and magically does the joins for you, I'd like to see how that works.
Gary Chambers over 13 years

Said in true Microsoft-Kool-Aid-drinking fashion!
onedaywhen over 13 years

@Vincent Malgrat: thanks for confirming that you have indeed misunderstood the concept of normalization. "you should not have the same information repeated on multiple rows" -- did you really mean to say "information"?! A compound key will usually involve values repeated on multiple rows. For a foreign key, values are referenced rather than "repeated", big difference. A single-column domain with two values (e.g. 'Yes' and 'No') will have the same values on multiple rows in a referencing table if it has three or more rows. This is really basic stuff!
telent over 13 years

Worth mentioning in this connection that [email protected] and [email protected] may be the same mailbox or may be different mailboxes and you have no way of telling - there's nothing in the spec to say whether the local-part is case-sensitive.
Reddy over 13 years

+1 for the line "Assume some e-mail provider goes out of business."
David Thornley over 13 years

Be Wolves: I knew a woman once who didn't have her own phone number. What do you do then?
Admin over 13 years

This is more a general issue with the uniqueness enforcement of email addresses rather than whether they should be used as primary keys - the same issue is there either way. +1 because it is still a very useful point
highwingers about 11 years

also to add...people male TYPO errors when creating accounts on line...this will also make nightmare for you to change email address all over the places.
Rafa about 11 years

This is not a problem. Foreign-key cascading exists to solve this issue. If a user changes their email, the change will cascade to all the tables using it as foreign key.
a_horse_with_no_name about 11 years

No, inserts it will be slower, because you need two unique indexes: one on the generated primary key and another one on the email address.
dburges almost 11 years

@rafa, I assure you that if you use cascading updates and a whole provider goes out of business or changes their name (Yahoo.com becomes HooYa.com), your database will be locked to all users for hours and maybe days while this cascaded through the system. It is a very valid problem (and a reason why it is a poor idea to use cascading updates if you have any significant amount of data and the key is likely to change.)
Rafa almost 11 years

@HLGEM as with everything, it depends on the size. If you're facebook, it can go very wrong. If you only have a few thousand users, it's probably "OK"'ish... It also depends how many other tables you have. And one should also consider how often big mail providers disappear. Small providers may come&go, but they're small, so you may not have a million rows with their emails. I don't use emails as PK myself, though, but not because of fear of mail providers going away.
dburges almost 11 years

I'm senstitve to the issue because our clients are in an industry where there are frequent mergers and we have large numbers of emails change very frequently. Not designing for things that can normally be expected to happen in the life of the data is irresponsible. You have to consider how the data will be maintained over time when making database design decisions. This not premature optimization, this is responsible design.
Stefan Steiger over 9 years

@Sjoerd: The issue is not that the email-address is stored multiple times, although that's definitely inefficient, but who cares about hard drive space today. Most businesses don't have google-scale, where this would matter. The issue is that the email address cannot be changed afterwards, because it's both a primary key & referenced as foreign key.
Jonathan Allen over 9 years

@StefanSteiger Who said anything about hard drive space? Anything you store is going to take up space in RAM.
Jonathan Allen over 9 years

SQL supports cascading updates because sometimes people break the rules and the folks who clean up after them need work-arounds.
Jay over 8 years

Very late follow-up: I'm hard pressed to think of the last time that all x.com email addresses got changed to y.com. How would they do that? If, say, Google were to buy out Yahoo, they couldn't just declare that all Yahoo email addresses will have the same value before the "@" but after the "@" will now be "gmail.com", because that would create duplicates. And in any case they'd probably just pay the small fee to keep the old domain name. Yes, small providers go out of business and just disappear. But either way, changes are going to come in one by one, not as some huge bulk change. ...
Jay over 8 years

... If you have thousands or millions of records using any given email address as a foreign key, i.e. many many child records for each parent record, or worse, many many records spread across many tables, a cascade may take an unacceptable amount of time. That is certainly a factor to consider when designing a DB. I think that would be a relatively rare case, though.
tofutim over 8 years

In case any one wonders, as I did, a GUID key would be equivalent to an email key I think.
malhal almost 8 years

Thought I'd point out that now we have cascade this isn't an issue
Philip Schiff about 7 years

@DavidThornley Sounds like you should work out more, or perhaps adapt a friendlier demeanor.
MADforFUNandHappy over 3 years

@StefanSteiger thanks for that information, so a uuid saved as a varbinary primary key would be better right ?
Stefan Steiger over 3 years

@MADforFUNandHappy: Yes, uuid would be better. But you can use whatever you want as primary key (auto-incrementing int/bigint/int128, uuid, varbinary). It's just that if you put data into the primary-key, the data becomes unchangable, because you'll have foreign keys referencing the primary key (aka the data). Once that happens, you cannot change your data anymore (e.g. the email address).
Stefan Steiger over 3 years

@Jonathan Allen: Yes, of course, there's also RAM. But unless you have google-scale, or something else quite large, that won't matter much as well. But it's inefficient, yes.
Jonathan Allen over 3 years

@StefanSteiger Compare the size of an int to an email address. Then think about how many indexes you have that are using that as a key times the number of rows in each index. I may not have Google scale data, but I don't have Google scale hardware either.
Stefan Steiger over 3 years

@Jonathan Allen: Hmmm, true, that bloats the db times N by foreign-key indices. Definitely not a good idea.
MADforFUNandHappy over 3 years

@StefanSteiger What do you mean you can not change the email anymore, only the primary key can not be changed anymore anything else in the user table e.g. username,email,password,firstname,lastname can of course be changed. UPDATE: Oh okay I misread your post, you mean if the email address is used as the PK then you can not change it later right ? Now it makes sense, although I would never use the email address as the PK.
Stefan Steiger over 3 years

@MADforFUNandHappy: Glad you got it. Technically, you could set on update cascade on the foreign key to solve that, like dba.stackexchange.com/questions/84434/… . But when the DB-schema is already botched, you probably don't want to drop and re-create all foreign-keys.