Duplication

NoSQL databases frequently are designed to shard or partition across many nodes. That, of course, makes enforcing unique IDs different than you might do in a SQL database. You can't efficiently have an autoincrement sequence, and instead have to have something like a UUID.

But if you've designed your NoSQL database badly, or your input data isn't well sanitized, you might find yourself in a situation where you can't guarantee uniqueness without validating every row. That's a bad place to be, but it's probably how the code Remco found started its life.

The purpose of this Java code is to query all the customer IDs from a database and ensure that they're fully unique.

    private Completable validateUniqueCustomerIds(List<Container> rootContainers) {
        if (!validateUniqueIds) {
            log.trace("validateUniqueCustomerIds validateUniqueIds == false -> skipping validation");
            return Completable.complete();
        }
        // use Flowable.share() to make sure we only call the repository once.
        final Flowable<String> nonEmptyCustomerIds = someMongoRepository.getCustomerIds()
                .filter(StringUtils::isNotEmpty).share();

        final Set<String> uniqueCustomerIds = nonEmptyCustomerIds.distinct().collectInto(new HashSet<String>(), Set::add).blockingGet();

        final Set<String> allCustomerIds = nonEmptyCustomerIds.collectInto(new HashSet<String>(), Set::add).blockingGet();
        final Set<String> duplicateCustomerIds = allCustomerIds.stream()
                .filter(id -> !uniqueCustomerIds.contains(id))
                .collect(toSet());

        if(uniqueCustomerIds.isEmpty()) {
            log.trace("validateUniqueCustomerIds uniqueCustomerIds.isEmpty(): true, returning Completable.complete()");
            return Completable.complete();
        }

        if(!duplicateCustomerIds.isEmpty()) {
            log.trace("validateUniqueCustomerIds duplicateCustomerIds.isEmpty(): false, returning Completable.error...");
            return Completable.error(new IllegalStateException("The following Customer IDs are non-unique: " + duplicateCustomerIds));
        }

        String version = rootContainers.get(0).getVersion();
        return checkExistingCustomerIdsInOtherPlaces(rootContainers, version, uniqueCustomerIds)
                .doOnComplete(() -> log.trace("validateUniqueCustomerIds checkExistingCustomerIdsInOtherPlaces complete"))
                .doOnError(throwable -> log.error("validateUniqueCustomerIds checkExistingCustomerIdsInOtherPlaces error: ", throwable))
                ;
    }

So, probably reasonably, we see that a variable (controlled somewhere else) enables or disables this method- skipping this validation seems like a thing you want. I'm not sure it's the best way, but we can let that slide.

Next, we fetch all of the non-empty customer IDs. Then we collect the uniqueCustomerIds by calling distinct and putting them in a Set.

Then we collect allCustomerIds by putting them in a Set without calling distinct…

Then we look at every entry in allCustomerIds and filter out the ones that are in uniqueCustomerIds, and absolutely none of that makes any sense.

First, the distinct call is unnecessary since we're collecting into Sets, which are by definition, only going to hold unique entries. uniqueCustomerIds and allCustomerIds are going to be the same sets. But even assuming that they're different, the filter makes no sense and isn't even the right operation. They attempted to do an intersection of two sets and failed.

The end result is a function that takes the wrong approach to solving a problem that itself was caused by taking the wrong approach.

Remco writes:

I've found this snippet in our codebase and it's been in there since 2020, about 2 years before I joined this team. I wonder how this got past code review?