Unique Identifiers – Part 1, Identifying the Generator

Is a UUID really unique? This post seems to think there are problems. http://blog.joeware.net/2005/06/19/42/

There are several types of UUID’s, the most common being (supposedly) randomly generated, and one based on time and MAC addresses. There are problems with each of these.

The above link has good opinions on the fallacy of randomly generated addresses. Besides bad implementations, there is also a great reason for not using them — they are so damn unique, there is no way to look them up unless you have a database of all which are in use. You can’t specify ranges where you can divide the database into segments.

MAC addresses have nothing to do with Apple Macintosh computers, although the computers have MAC addresses in them. MAC addresses are unique identifiers brazed in read-only on your computer’s network card. Ranges of MAC addresses are doled out by a central authority to network card manufacturers.

There is something obviously wrong with that scheme in that we’re expecting the network card manufacturers to police themselves to ensure that addresses aren’t duplicated. When you pay sixteen dollars for a piece of hardware, made in the poorest manufacturing environment in the world, do you really feel that good that the manufacturer has paid their dues and been in compliance with this directive, that isn’t even backed by law? OK, but just be sure you take that painted wooden play block out of the baby’s mouth.

If that wasn’t enough, there’s a real possibility that someone has set up duplicate MAC addresses on their LAN intentionally. Some ISP’s (Internet Service Providers) and wireless configurations use MAC addresses to allow access to the next network over. Of course, you and I wouldn’t come up with a scheme like that since it relies only on the obscurity of the address, and not a shared secret or hashed password. But that’s the nature of the market.

So what happens is that the customer registers his MAC address with the ISP, and is now allowed into the network as long as he’s using that ethernet card. Later on, the customer trades in his computer, or replaces the ethernet card that’s gone bum, or sticks a router in between his computer and the internet. Whatever he’s done, he needs to change is MAC address; either register his new one with the ISP, or change the new hardware back to the old MAC address. Given the choice, he will probably not want to spend quality time listening to muzak while waiting for customer service to answer the phone and then find someone at the ISP that has a clue.

So, especially with the “insert a router in the mix” scenario, there’s a good possibility that there is a duplicate MAC address on the LAN. Intentionally done, with good intentions. The intentions don’t even have to be good, as one could use this scheme to leech access from the ISP from multiple locations.

Finally, MAC addresses are a lousy way to do generate unique ID’s that represent real-world data. That’s because in order to look up an unknown piece of information, you need a map of MAC addresses to computers. When collaborating with others outside your LAN, you’d have to replace your UUID’s to some central MAC address.

You also have a problem within one computer itself. You may want multiple programs to be able to generate unique ID’s, but there is only one MAC address per computer. So, you would have to have some locking mechanism so that more than one process can’t have access to the ID generator at a time. This scenario is even specified in the UUID specification document, www.opengroup.org/dce/info/draft-leachuuids-guids-01.txt

Providing a system-wide lock makes programming a tool much harder; harder than it has to be. At a particular time, a process has a unique process id, which the computer uses to keep track and schedule each thing that’s going on. Well, with Linux 2.6 (the current version), there are a maximum of 1,073,741,824 process id’s. Encoding that many numbers in base32 takes six characters. And in the future, Linus and crew might up that number — you would think a million things going on with a computer at once would be enough for anybody!

So it’s probable that when implementing a unique ID with a process ID incorporated, we’ll need to cut this number down to size. That would mean some extra programming in order to have a registry of processes allowed to give out an ID. So we’re back to a locking mechanism, but arguably more effective, since you only need to register the process one time, and provide some way to clean up the mess after you’re done.

But let’s say we didn’t want to make a registry, and just confine our ID’s to the computer we’re using right now. After all, we don’t need to come up with an ID for the computer until we’re ready to share the information outside of it, right? So we can flag the ID with a code that says we’re only going to have it valid on the current computer, then list the process ID after it. We could even have a multithreaded program be able to give a short ID to each thread; creating a in-memory mapping within a single program would be fairly easy, you’d just map the thread id when creating the thread, if mapping is really needed.

Now, if we’re not using MAC addresses as the primary way to identify the computer, what’s the best way to go about it? More registries, please! We’ll want to make some kind of global registry so that an unknown value can be looked up, and if the author wants to share that information, then you’d be able to get its definition.

At the public side, we may want to consider a range of five characters, which our base32 converter tells us will provide over 35 million identifiers. So that’s how many public domains would be available, and is still in the realm of a Sqlite database lookup without making a small computer go dizzy. Five’s a good number, since a sequence of characters can be read off over the phone and copied down without the recipient losing track of the characters spoken.

Now we have to have some layers of redirection from the public identifier to the individual computer. For one, this gives us some sense of privacy. But even more important, this allows a large organization to give a level of autonomy to its divisions.

Two base32 characters provide up to 1024 values, while one character is 32. So with five more characters, we can allocate one character to the process id registration (32 processes per computer), 1024 computers per division, then 1024 divisions per public identifier. Now, I’m curious if that would be a good spread. What do you think?

Another way would be to have 1024 process id’s per computer, using two characters, then let the public ID have 32,768 computer id’s to dole out as they see fit. Any more than that, they’d just need to get another public ID. Now I feel better about using that spread.

If we wanted to go whole hog, then five characters would be enough to directly convert internal IP addresses to computer id’s, no matter if they used the 10.x.x.x, 172.x.x.x or 192.161.x.x scheme, or even a combination of those. But now that we’re at our five-character maximum, we have to allocate even more characters for our process id. And if the company started using sub-sub-nets, which may be common with virtualized OS’s, internal IP addresses can very possibly be duplicated, as I’ve seen with VirtualBox.

For giggles, the full range of IPv4 address space comes to 4000000 in base 32. Some expect exhaustions of address ranges as early as two years from now; but I just don’t think it’ll happen for a while. But if the world moves to IPv6, that’s over ten with 38 zeroes after it — way to big for us to make a usable ID format out of anyway.

Well, I think that about covers all the scenarios we’d want to identify who generated the ID. Note that I am not making any effort to conceal privacy, if one wants to keep the data anonymous then they will have to arrange for an anonymous computer to do their dirty work. I would rather forsake simple anonymity than to make our ID’s so that they are incompatible with any type of accountancy.

Now we need a series of unique numbers for our ID generator to use. Which will bring me to my next post, about the strange concept we call time ….

Advertisements

1 Comment

  1. Mattyuy said,

    March 24, 2008 at 2:17 pm

    thats for sure, brother


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: