A few times when I’ve conducted technical interviews with candidates for Software Engineer roles at the company where I was working, I’ve come across people who actually know how they might filter or sort a list, however when I delve into the depth of the problem, I start to find difficulties.

Many times, as software developers, we get used to using libraries and APIs that have most complex solutions already implemented in a single function, making our lives easier. This doesn’t mean that we put aside our curiosity and ability to investigate how things work, at least in its most basic form.

In this post, we are going to specifically analyze a very used function of the Kotlin API, this is distinctBy.

You can find the official documentation by clicking at this link.


Applying an algorithm in real life

Suppose you have a list of products in which, for reasons not relevant to this post, there are repeated elements. We clearly don’t want to show the user a list of duplicate items. So we know that we should filter the list.

We have a couple of alternatives:

Filtering the list with a loop

We could instantiate a new empty list and looping through the original list, adding element by element only if the current item of the iteration is not found in this new list.

Here we have a slight problem, that “only if it is not found in this new list” implies a search. Let’s assume that we haven’t used a hash table and are using a simple ArrayList.

We would have something like this:

fun filterList(products: List): List {
   val newList = emptyList()
   products.forEach { product ->
       if(findProduct(products, product) == false) {
           newList.add(product)
       }
   }
   return newList
}
  
fun findProduct(list: List, productToFind: Product): Boolean {
   list.forEach { product ->
       if(product.id == productToFind.id) {
           return true
       }
   }
   return false
}

Doing a quick analysis, the filterList function will call the findProduct function N times, where N is the number of elements in the original list.

The findProduct function will take up to N times to complete. Assuming that the element to search for is in the last position of the list, the worst case (worst case scenario) will be O(n).

This means that the complexity of our algorithm will be O(n²). That is, it could take units of time to finish. And if we talk about Memory Consumption (space complexity), we will have a new list, that is O(n).

Filtering the list with the distinctBy function

Great. It turns out that we know of the existence of a function called distinctBy in which we must pass it as a parameter, a function that indicates the “key” which will serve as an indicator of whether an element is repeated or not. This can be a first name, a last name, or a code.

We just need to write:

val newList = products.distinctBy { product -> product.code }

¡And, voilá!

Ready, one line of code was enough and we already have a new list that does not contain repeated products with the same code.

But… do we know what he does inside?

Let’s analyze the source code of this function.

As we can see, it is an inline extension function applicable to any data structure that implements the Iterable interface.

This function depends on two generic classes that it will need for its operation: one that determines the data types of the list to return (the same as the original list) denoted by the letter T, and another that determines the differentiator of the objects, denoted with the letter K.

In the first two lines, two new objects are instantiated. The first is a hash table or dictionary data structure that at the same time implements the Set interface (HashSet): it will not allow repeated elements thanks to the fact that it contains a hash table inside. The second is a simple list that will serve as the resulting filtered list.

Iterates over the original list (shown as this in the for statement) and executes the selector function that is passed as a parameter. This selector function is passed the current element of the iteration.

That means that in each iteration, the function { product -> product.code } will be returning what we want to serve as a unique identifier, that’s why its value is assigned to a variable named “key”.

Once we get the unique identifier (which in our example is the product code), we proceed to insert it into our hash table.

As we can see, this insertion occurs inside a conditional if.

For loop inside distinctBy function

The Set interface states that the add function will return true if, and only if, the element was inserted. And at what point is the element not inserted? Well, when it already exists (it is verified with a hash table!).

Once we know that this identifier did not exist in our identifier hash table (set.add(key) returns true), then we proceed to insert the element (the product) in our new list (list.add(e)).

No lookup is performed to see if the iteration item already exists in the list. We only try to insert its identifier (the product code) in the hash table and if this operation is successful (returns true) then we just add this product to the filtered list.

Since the whole original list is iterated anyway, the time complexity will be O(n), however, the verification of whether or not an element already exists in the list is returns in constant time O(1), due to our hash table inside our HashSet which will tell us if the identifier was inserted or not.

Of course, when creating two new objects, the memory space used will be a little larger, but not as relevant since the purpose of the HashSet is to store only the identifiers to know if an element has already been created. exists or not in the original list.


Summary

  1. The distinctby function uses a HashSet structure to speed up the cost of time with the verification of the existence of an element in our list.

  2. Unlike a HashMap, a HashSet needs only one element at insert time and it cannot be repeated, it acts as key and value at the same time (actually, if we look at the internal implementation, a generic static Object object is used as value).

  3. Of course, distinctBy offers us better performance than if we opted for an implementation of two iterations, one inside the other. However, we could also have used a HashMap and inserted on top of it using the product code as the key and the product as the value. The result would be a clean product structure.

What did you think? Do you also often analyze the functions that we use on a daily basis and that make our lives easier? I would like to know if you know of any other functions or APIs that also make use of these data structures efficiently and simplify our day-to-day developments 😄.