Deduplicating iterables while preserving order in Python
Whenever I need to deduplicate the items of an iterable in Python, my usual approach is to create a set from the iterable and then convert it back into a list or tuple. However, this approach doesn't preserve the original order of the items, which can be a problem if you need to keep the order unscathed. Here's a naive approach that works:
This code snippet defines a function dedup that takes an iterable it as input and returns a new list containing the unique items of the input iterable in their original order. The function uses a set seen to keep track of the items that have already been seen, and a list result to store the unique items.
Then it iterates over all the items of the input iterable using a for loop. For each item, the function checks if it has already been seen (i.e., if it's in the seen set). If the item hasn't been seen, it's added to both the seen set and the result list. The final result list contains the unique items of it in their original order.
This can be made a little terser by using listcomp as follows:
Dedup with ordered dict
Similar to the first snippet, this also defines dedup that takes an iterable it as input and returns a new list containing the unique items of it in their original order. The function uses the OrderedDict.fromkeys() method to create a new ordered dict with the items of it as keys and None as values.
Since an ordered dict maintains the insertion order of its keys, this effectively removes any duplicate items from the iterable without affecting the order of the remaining ones. The iterable containing the keys of the resulting ordered dict is then converted into a list using the list() function to obtain a list of the unique items in their original order.
While this is quite terse and does the job with O(1) complexity, neither this nor the previous solution would work for compound iterables as follows:
The next solution works on one-level nested iterables.
Dedup by any element of an item in a nested iterable
Consider this one-level nested iterable:
We want to write a dedup function that'll allow us to deduplicate the iterable based on a particular element of an item. Here, (1,1), (2, 1) are items of the iterable it and 1 is the second element of item (2, 1). Here's how we can modify the first dedup to allow deduplication by nested elements.
This time, the dedup function takes in an iterable of tuples it, an element index index, and a boolean lazy (defaulting to True) as arguments. The function returns a list or generator of the unique items from the input iterable based on the specified element index.
Just as before, the function first creates an empty set seen and binds its add method to a variable seen_add. It then creates a generator expression that iterates over it and yields each item if its element at the specified index isn't already present in the seen set. If item[index] isn't present in seen, it's added to it using the seen_add method.
If lazy is True, the function returns the generator expression verbatim. Otherwise, it returns a list created from the generator expression.
In the example provided, the function is called with arguments it, 1, and False. This means that it will deduplicate the input iterable based on the second element of each tuple and return a list. The result is [(1,1), (1,3)].
Further reading
Discussion in the ATmosphere