Friday, January 7, 2011

What's in your name...

One of my friends expressed that she can not remember the names that sound pretty similar to her. I do have the same problem - and, as I learned a couple of times, at least my surname causes a similar stress in some folks.

But, I remembered, it's been quite some time I was joking that replacing the names with SHA-1 hashes would be way more practical.

Consider:

1) SHA-1 hashes are all standard 40 bytes long hex numbers. This allows much more efficient storage of names in the databases and much more simplified input validation - it will be reduced to a trivial regex: /[0-9a-fA-F]{40}/ - and the input later on can be homogenized by lowercasing.

2) There's enough hashes to give everyone a unique ID. 2^140 which forty characters make, is huge. Of course, we need this huge number to prevent the hash collisions.
I should make the estimates of that a topic of a separate post.

3) The cryptographic hash is going to have pretty good distribution of digits. Corollary from this - means that within your circle of friends you do not need the full 40 characters. Much like close friends do not call you by surname - and just call you by name.

So, I've set out to look at (3) - because, obviously, the usability in this is paramount.

Short result: amongst the 370 friends I have on facebook, if they were to use SHA-1 hash of their name instead of their "real name", just five characters will be enough to uniquely identify all of them! This is even shorter than my own name!

So, it is very promising.

Here's how to do it in case you want to verify your own circle of friends (I take the facebook as an example).

Go to http://developers.facebook.com/docs/api and click on the link next to the string "Friends: ". This will give you an ugly page full of JSON. Save it into file and go into the shell.


cat friends.json | sed -e 's/^.*\[//g' | sed -e 's/\].*$//g' | sed -e 's/[{}]*//g' | sed -e 's/,/\n/g' | sed -e '/^"id":.*$/d' | sed -e 's/"name"://g' >names.txt


Now that we have a list of names, let's make hashes. Simplest is to create a short script to hash its argument:


#!/bin/sh
echo -n $1 | sha1sum


And finally we output the file running the hash on each name, then cutting out the first few characters, and seeing how many collisions do we get:


$ cat names.txt | xargs -n 1 sh hash | cut -c 1-3 | sort | uniq -c | grep -v ' 1 '
2 2c3
2 3d5
2 4fe
2 6ef
2 b32
2 b52
2 b7e
2 c78
2 c9f
2 d6f
2 e42
2 e6f
2 f69
2 ff5


So, with just three(!!!) characters of the hash there only a handful collisions.
Using couple more characters eliminates them and as well gives an ample room for a collision-free growth of my friend list.

And maybe now you get an idea why at the top of the page there's this row of hex, to begin with.

I think using the hashes instead of name is indeed much more practical than trying to remember the names - c3e9f, do you agree ? ;-)

7 comments:

Phillip Remaker said...

Your SED script is not working for me. It seems to want to intert literal "n"s and doesn't delete the ID. Part of the problem, I think is that the "^" regexp operator doesn't like the fact that there are leading spaces in from to "name" and "id". Still debugging....

Nighty said...

My version of that sed command (which also sorts):

cat friends.json \
        | sed -e 's/^.*\[//g' \
        | sed -e 's/\].*$//g' \
        | sed -e 's/[{}]*//g' \
        | sed -e 's/,/\n/g' \
        | sed -e '/^ .*"id":.*$/d' \
        | sed -e 's/"name"://g' \
        | sed -e 's/^ *//' \
        | sed -e '/^$/d' \
        | sort > names.txt

Extra rules were added to get rid of the empty lines and leading spaces also. Gives you a cleaner file to work on.

for a mapping of hash to name, use this as the contents for your hash script:

#!/bin/sh
echo "`echo -n $1 | sha1sum ` $1"

ps: love how this scales by just using more characters of the hash. Kind of serves the same purpose as last names do, but with less chance of collision.

Andrew Yourtchenko said...

@Nighty: cool, thanks! On a second thought, you do not even need to increase the length of all hashes, just those that make up collision. Much like in real life when you have name+surname collisions - "John Doe - no, not the one that is used all the examples".

For practical use - name, surname, place of birth and the date of birth should be used as a sha1 source... Since, if we have the same name+username adding the digits to hash would not help.

Phillip Remaker said...

So why don't you just use the name+facebookid in the JSON? See how that reduces your collisions.

PS: Nighty's sed script adds a literal "n" at the end of every line for me, not sure why.

cat friends.json | grep name | awk -F"\"" '{print $4}' >names.txt

is what worked for me.

Andrew Yourtchenko said...

Not everyone (though almost all) have the facebook. And I do not feel too comfortable with some number of a private corporation be my SSN. I already have one issued by the country's government. (speaking of which, it could be in the mix as well in the countries that have it - to help reduce the chances of the same contents being hashed).

Phillip Remaker said...

Ultimately,

cat friends.json | grep name | awk -F"\"" '{print "\"" $4 "\""}' > names.txt

Worked best since it preserved the double quotes around the names.

Also, names with shell-flavored characters (e.g., O'Rourke) were troublesome.

Andrew Yourtchenko said...

Nice. In the hindsight, probably a better approach would have been a FB app, I *think* one can do a pure-Javascript one, could be fun.