Will my hashing cache keys get a conflict?

The short answer to the title question is YES. However a longer version is : After LPS-16789 you don't really need to worry about it. The possibility for seeing an actual conflict is negligible for most people in most cases.

If your system is pursuing 100% safety in any case, stop reading this blog, go to LPS-16744, that is what you need.

However, if you are willing to take a little risk (It is very very low, keep reading you will see the actual number) by hashing your cache key for store, the reward will be a performance boost on old generation gc improving.

Let's clarify some conceptions first:

  1. No hashing algorithm is strong enough to prevent any conflict from unlimited input source. The reason is there is no one-to-one mapping from an umlimited space to a fixed size space.
  2. No system is 100% safe. Are you sure your colo will never have a power failure? Or your hardware will never have a "heart attack"? (Even the EARTH can not live for ever).

Your system will eventually halt, either by an exceptional reason or for an on schedule maintaining. So if the hashing conflict possibility is lower than those non-preventable situations, I suggest you to just let it go, the performance pay back for this will be much more worthy than your sacrifice.

Before I start to do math for caculating the hashing conflict possibility, I have to make an assumption.

Like no hashing algorithm is strong enough, no hashing algorithm can do the mapping in an absolutely even manner. Liferay's HashCodeCacheKeyGenerator is using the following formula(which is the exact same algorithm as java.lang.String.hashCode() using):

Hash(s) = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Note: s is a String with n length.

It is not absolutely even for sure, to analyze its flatness is beyound my limited mathematics knowledge, so please pardon me to consider it as an even algorithm from now on for the sake of simplify analyzing.

With the above assumption, the problem now is much clear, let me describe the same problem in a different way:

Throwing n balls into b buckets(n <= b), what is the possibility that we end up with no bucket has more than 1 ball?

While I was trying to remind myself about how to do the Permutation and Combination, my wife surprisingly gave me the follow answer within 1 minute(To be honest, I was pissed off by this. Hate to admit she can do math better than meBut glad for married a smart women)

P = (b/b) * ((b-1)/b) * ((b-2)/b) * ... * ((b-n+1)/b)
    = (b * (b-1) * (b-2) * ... * (b-n+1))/b^n
    = b!/(b^n*(b-n)!)

Then the conflict possibility is:

Pc = 1 - P = 1 - b!/(b^n*(b-n)!)

Now let's put numbers into this formula, we use long (64bit) to hold the hash result, so we have 2^64 usable hash code, which means b = 2^64.
The n is your cache size, although you could try to configure your cache to support a super large size, but unless your using an off heap cache provider, I can promise you mostly likely you will see an OOM before the hashing conflict appears.

A reasonable in-heap cache size in the real world would be about: 10,000 ~ 100,000 entries.
If you are using a cache more than this size, maybe you should consider to break it down into smaller regions and maybe even move them out of the jvm heap, regardless of the hash conflict, a huge cache region is slow for looking up and inserting.

Cache warmup hashing conflict possibility
Cache size Conflict possibility
10,000 entries 2.710e-12
100,000 entries 2.710e-10

 

Please be aware the number above is the possibility of conflict when you insert 10,000 and 100,000 entries into an empty cache. I will call this cache warmup hashing conflict rate.

Once the cache is properly populated, its size will remain, however new entries will come in, old entries will evict out, the cache is entering swapping phase.
Now let's analyze the hashing conflict possibility for a swapping cache.

This is actually quite easy, you have n balls inside b buckets(n <= b), no bucket has more than 1 ball, now you throw a new ball toward those buckets, what is the possibility that you end up with putting the new ball into a bucket that already contains a ball? Obviously it is n/b.

The difficult question here is how othen do you throw a new ball(inserting new entry into cache)? For any individual cache the actual insert rate normally is very low, the reason is cache is designed for reading, not writing. If you have a cache that is mainly be written to, rather than reading from. You'd better reconsider your design. In my experience this number should be under 50 ips (insert per second), well this is an experience number, I can not prove it. If you have a way to measure the actual insert rate of your cache, please replace it with your number.

Note: you may see a high cache missing rate from Ehcahe JMX report for certain cache(could be 100+ per second sometimes). In ehcache, missing means go fetch from DB then cache it, which equals to an insert. But this is not the exact same insert rate I am talking here. In my assumption, the input source is unlimited which means it won't generate any same result, every single new entry is different from all history results. This kind of thing does not exist in real world. All input source repeat itself at certain degree. For example, you may find user cache has a super high swapping rate at peak load, maybe 300+ ips(Means within one second, your system has 300+ users login and 300+ users logout). But it is ok, people login/logout, for the same user no matter how many times he login, it only counts once. The reason is the actual cache key for that user is same, insert same key multiple time will not increase the conflict rate at all. Only different keys can cause conflict.

Now let's see what is the possibility of keeping your system running for whole year long without maintaining(not even clearing the cache from Liferay's control panel).

There are 365 * 24 * 60 * 60 = 31536000 seconds in a year, let's say your cache insert rate is 50 ips, then in one year you will insert 31536000 * 50 = 1576800000 times.

For each insert you are taking the risk of n/2^64 to see a conflict, n is your cache size, so we have:

Cache swapping hashing conflict possibility
Cache size Conflict possibility
10,000 entries 8.548e-7
100,000 entries 8.548e-6

 

So in summary, within one year if the possibility for your server to be out of service (no matter it is a system failure or an on schedule maintaining) is higher than 2.710e-10 + 8.548e-6 ≈ 8.548e-6(This is almost guaranteed to be true), I suggest you to hash your cache key for store. The possibility to see a conflict is very very low, even it actually happens you can recover it by clearing the cache in control panel.

Let me emphasize this again, if you do need a 100% safe cache key, see LPS-16744, that is the solution.

Blogs
Shuyang whenever i read your post i feel like i know nothing. Wish i had given more importance to Mathematics in college. Thanks for explaining.
Thx Shuyang emoticon In short, we'll be using HashCodeCacheKeyGenerator by default for peformance vs. SimpleCacheKeyGenerator.
Good Job Shuyang with all the performance improvements. I think soon you may help release the next version of Java!!!
Great Job ! Shuyang............ Whenever I read your blog I amazing. Good.....Keep it up
Is there anyway to get all the keys that are in the Cache and we are using multi VM pool. Issue is we are adding data with key to cache and trying to update data for each key at some point of time. Thanks.