Wednesday, August 08, 2007

Friendly Readable ID Strings

One often needs unique ID strings to tag processes, threads, files or anything else that might be created in quantity. Traditionally these are created based on PIDs or similar system values. But it is not easy to visually recognise these strings, which makes debugging more difficult than it need be. This recipe creates readable and pronounceable ID strings that are much easier to work with.

I have submitted this to the Python Cookbook but thought I would write it up here as well. The code is freely available under the Python license.

The inspiration for this recipe is the fact that I hate debugging systems by trying to track IDs that look like "kSbT73oQ". It's much easier on the brain to scan strings like "levokosi". These are also pronounceable, so one can communicate with other members on the development team. "Hey dude, what's up with file sadofabi -- it appears to be locked."

I many cases one does not need billions of IDs, so restricting the strings to alternating consonants and vowels, and a reasonable length, gives us plenty of scope. The function GetFriendlyID() generates these of length 8, using the built-in Python pseudo-random generator. If this is not good enough, increase the length or use a more random module. But before you do so, realise that the algorithm here allows over 40 million unique strings.

(Of course it is sometimes useful to have ID strings based on PIDs or whatever, so they can be tied back to the originating process. In those cases this recipe is not applicable.)

The GetUniqueFriendlyID() function sketches out a typical implementation, where we want to ensure the generated IDs are, in fact, unique.

The module-level code does a simple test so we can inspect some resulting strings. Baby name generator, anyone?

from random import choice

def GetFriendlyID():
Create an ID string we can recognise.
(Think Italian or Japanese or Native American.)
v = 'aeiou'
c = 'bdfghklmnprstvw'

return ''.join([choice(v if i%2 else c) for i in range(8)])

def GetUniqueFriendlyID(used_ids):
Return an ID that is not in our list of used IDs.
# trying infinitely is a bad idea
LIMIT = 1000

count = 0
while count < LIMIT:
id = GetFriendlyID()
if id not in used_ids:
count += 1
id = ''
return id

if __name__ == '__main__':
from sets import Set

print 'some sample unique IDs:'
used_ids = Set()
for i in xrange(50):
id = GetUniqueFriendlyID(used_ids)
if not id:
print 'something broke'
print id

Edit 10 August: Fixed resetting ID string in GetUniqueFriendlyID(). Fixed import. Changed implementation example to use a set to store IDs.

Edit 8 August: I've simplified the algorithm into what is basically a one-liner. This version requires Python 2.5 as it uses conditional expressions.



Anonymous said...

Dude, that is a terrifically creative idea.

Chuck Adams said...

I've seen this long ago with password generators, and it's for the same reason: it's much easier to remember something that looks like a word. The one I remember from LambdaMOO also took care to avoid some of the obvious combinations that looked obscene, though it's impossible to get all of them. Imagine one of your users saying "process sukapenis is hung"? (huh huh..huh)

Anonymous said...

Instead of keeping a list of generated names I would append a unique number to the end of the generated word. Even a thread safe accumulator would guarentee uniqueness across the life time of the process.

Antti Rasinen said...

You can also precompute a list of character combinations:

alternating = [consonants, wovels]

And then use map:

"".join(map(choice, alternating))

Of course, with this method you can also build several variants of acceptable consonant-wovel combinations, such as vccvcvcc (abbarocks) or allow capital letters (at least for the first letter) and so on.

You can also add these in a list and then pick one randomly from those.

I would not use a list for storing the used IDs however. If the list is large and the search fails, that's a LIMIT * len(list) operations. A set would be more applicable. Or perhaps even a binary search tree =)

Ralph Corderoy said...

The unique version returns a non-unique id after LIMIT attempts; shouldn't it do something else? Returning a non-unique id may cause huge problems later.

Perhaps a better way would be to use the numbers from a PRNG and map those onto pronouncable words, discarding unpronouncable ones. That way, you know the period of the PRNG is large, that you're discarding only 66% of them, and you only have to remember the current seed to resume where you left off, not a list of all previously returned ones.

Cheers, Ralph.

David said...

Very nice even if one of my first generated values was diketude! Try explaining that one to the QA staff ;-)

Ants Aasma said...

You might want to check out Yould. Maybe a bit too overkill for given purposes, but I'm using it as a great way to automatically create strong easily memorizable passwords. Had to extend it with arbitrary length markov chains to support my own language - we routinely have three consonants in a row and the default chain length generated a lot of unpronouncable words with four or more consonants in a row.

Post a Comment