Monday 16 July 2012

Nose Dive into core Java: Understanding equals(Object o) and hashCode()

In this series, I plan to write on some of the fine details of the core Java API, some performance optimisations and some advanced Java concepts.

In this installation of the series(more to follow), we learn all there is to know about equals(Object o) and hashCode(), two fundamental methods that every serious developer must fully understand and keep in mind for the rest of their lives.

So what are these two methods?

These are methods defined in the Object class in java.lang package.
So every object you instantiate, inherits these methods because all objects extends from the java.lang.Object class.

The method definitions in the Object class are as shown below.

So why do we need to master them and what makes them special?
They are special and we need to master them because certain other aspects of the Java API rely on the correctness of these two methods for their efficiency and correctness when used, the biggest ones being the Collections API in the java.util package.

Even if you don't use the Collections API that much, you still need to know these two methods and their requirements thoroughly if you are implementing anything that has performance characteristics.

equals(Object o)

The purpose of the equals method is to provide a facility for us to be able ask, at any point, in the life of an object if an object we hold a reference to "equals" another object. It allows us to compare ANY two objects!

Literally, we want to ask

If we were to assign two numerical values to each of these two objects, should the two numbers be the same?

This question is very different from asking,

"Do the references for these two objects point to the same memory location?"

The second question is different because it asks if two objects are the same, i.e, occupy the same address space.

Just for emphasis, the equals method in the Object class and therefore the JDK asks the first one.

The default implementation of the equals method in the JDK is implemented as shown below.

Now whilst this implementation looks simplistic, almost like the guys at Sun(Ooops Oracle)are playing lazy, it has some serious consequences.

By using the equality operator on reference variables, the default implementation says if the two objects have identical bits, then they are equal! This is because, the equality operator on reference variables compares just the references being considered.

Understanding the equality operator

To fully understand what the code above does, lets remind ourselves quickly how the equality operator works with a quick example.

Consider the code below.

When we execute this code, this is the result we get;

Back to equals(Object o) method
So we can now understand that the default implementation, only asks if the two objects we are comparing have the same reference.

What does this mean?

Effectively, the JDK's default implementation answers our second question and uses that
answer as the answer to the first.

So the default implementation effectively says,

"If two objects have the same references, then they are also equal."

That in itself is not a lie since a reference can belong to only one object. No two different objects can have the same reference. So the default is true but a very narrow true.

The default implementation doesn't answer the first question comprehensively. It only takes the minimal truth that will always hold, but leaves the concept of equality of objects as a responsibility of the developer to decide. This is Object Oriented Programming(OOP) at its core.

You as the creator of the object must decide when two objects are equal. You can play along with the default implementation for trivial classes but in the end, it is your responsibility as the creator of the classes that represent the Objects in you domain space, to define the how equality is implemented among the objects you create.

The JDK developers have left a nice slot for you to answer what makes objects equal and that is why it is important to override the equals method in any non-trivial class that you implement.

What happens if I don't care about equality?

That's fine as long as you don't touch any of the Collections API classes, then you MAY be able to get away with your laziness. But if you touch any of them especially the ones that have Hash in their names, you run into serious trouble.

To fully appreciate it, lets think about why equality is important in the collections world.

We humans have used the concept of names to identify people since the stone age or possibly before then. But the flaw in the use of names is very easy to see. Just put 10 people in a room and give two of them the same name. The moment you call that name, we have a conflict. You must use another way to identify the one you want to call!

The creators of the Collections Framework sidestepped that conflict by partly relying on the equality of the objects in your collections. When Objects are put in a Hash* Collection, they go into baskets by the integer number returned by their hashCode() method. When you search for objects in that collection, the Collections API uses this value to identify the basket they went into.
Now in the selected basket, the collections API relies on the equals(Object obj) to find the object you are looking for. This means, if you have not overwritten the equals(Object o) method to define equality in a way you can recreate, and you use this Object as Key in a HashMap, then the only way you can fetch your items is to have a reference to the key.

Lets do an example to understand what we are talking about.

Suppose we want to use This class as a key for our data

And our data is actually represented by the following dummy class

Now lets see the how the argument made so far by implementing a lookup like this

When we execute to code above, we get the results shown below

In the second lookup, the result was not found because the equality operator used the references to try to find the Data.

To understand exactly why the second lookup failed, The following code shows the OpenJDK implementation of the get method of HashMap.

As you can see clearly on line;

The implementation is relying on the keys equals method key.equals(k). Since the default implementation uses equality reference operator(==), it means the second lookup will fail because it is in a different address location and hence the null result.

Also note that this implementation allows for null keys and will use getForNullKey() to find which ever object was stored under the null key.

The value returned can also be null as can be clearly seen in the last line.

I hope now you understand how the equals(Object o) method affects your use of the Collections Framework.

It doesn't stop with filling baskets

Well apart from using the equals and the hashCode() to store and retrieve Objects, The Collections Framework also relies on the equals in a number of places.

Typically all the contains methods in the List implementations rely on the equals(Object o) for retrieving the associated objects and these are clearly stated in the documentation for those methods. Also, the Set classes rely on equals(Object o) for making sure there are no duplicates.

All the following from the Map interface rely on a correct implementation of the equals(Object obj) method.

Also the following from the Collection


It doesn't stop with the Collections Framework...Contracts

The equals method also have contractual obligations it must adhere to. This is the Java API telling you to obey certain rules when overriding the equals(Object o) method.

The JDK's execution, performance etc are only guaranteed if your implementation obeys these rules. So if you override equals(Object o), your implementation must obey them.

The rules are - (ChRiST - No) - you can interpret it whichever way(for/agains/Swear) - just trying to help you and myself to remember

1. C-Consistent - For any two references x and y, irrespective of the number of times you invoke x.equals(y), it should consistently return true or consistently return false as long as none of the variables used in the definition of equals() have not changed.

2. R-Reflexive - For any reference x, x.equals(x) should return true. i.e., An object should equal itself! Logical, isn't it?

3. S - Symmetric - For any two references x and y, x.equals(y) should return true if and only if (iff) y.equals(x).
Put another way for programmers,

==> b must always equal true


4. T – Transitive – For any three references, x, y and z, if x.equals(y) return true, and y.equals(z) also returns true, the x.equals(z) must also return true.

5. N - Null Compare - for any non-null reference x, x.equals(null) MUST return false.
So its easy to that this implementation

followed by this call

will lead to compliance police call

Conclusion

.
1. Object equality is a responsibility of the developer to define. The default implementation only compares objects using the equality operator.

2. You MUST override equals(Object o) method of Object class if your class is going to be used as a key in any Collection, especially Hash-based collections(HashMap, HashSet, LinkedHashSet)

3. When you override equals(Object o), you MUSt obey the contractual obligations of the equals(Object o)method.

I will follow up with hashCode() in the next series.
Thanks for your time.

No comments: