Skip to content
May 28, 2011 / oop123

Removing Duplicates From an Array in Java – Sort and Remove vs LinkedHashSet

I described two ways of removing duplicates from an array: one is Sort and Remove and the other is using LinkedHashSet (or any Set for that matter, but LinkedHashSet retains the order of elements).

Which way is faster? A tough question without doing some testing. Unfortunately, I don’t have Mac or Linux, so this test is only limited to one platform and CPU (Windows 7 on AMD Athlon II Dual-Core M320 2.10GHz); I’m also just a Java beginner, so keep that in mind. All the test results are from the average of 10 runs of the test.

Objects

import java.lang.reflect.Array;
import java.util.Arrays;
import java.util.LinkedHashSet;
import java.util.Set;

public class Testing {
    public static final int NUM_ELEM = 10000; 
    public static final int RANGE = 1000;

    public static void main(String[] args) {
	    Integer[] array1 = new Integer[NUM_ELEM];
	    Integer[] array2 = new Integer[NUM_ELEM];
	    for (int i = 0; i < NUM_ELEM; i++) {
            array1[i] = new Integer((int) (Math.random() * RANGE));
            array2[i] = array1[i];
        }
                
        long start;
        System.gc(); System.gc(); System.gc();
        start = System.currentTimeMillis();
        removeDuplicatesWithSet(array1);
        System.out.println(System.currentTimeMillis() - start);

        System.gc(); System.gc(); System.gc();
        start = System.currentTimeMillis();
        removeDuplicatesWithSort(array2);
        System.out.println(System.currentTimeMillis() - start);
    }
        

    @SuppressWarnings({"unchecked"})
    public static <T> T[] removeDuplicatesWithSet(T[] array) {
        if (array == null || array.length == 0) return array;

        //create a linked hash set - remove all the duplicates of array while retaining order
        LinkedHashSet<T> set = new LinkedHashSet<T>();
        for (int i = 0; i < array.length; ++i) {
            if (array[i] != null) {
                set.add(array[i]); 
            }
        }

        //create an array of T (using reflection)
        T[] noDup = (T[]) Array.newInstance(array.getClass().getComponentType(), set.size());
        int count = 0;                
        for (T element : set) {
            noDup[count++] = element;
        }
        return noDup;
    }
        
    @SuppressWarnings({"unchecked"})
    public static <T extends Comparable<T>> T[] removeDuplicatesWithSort(T[] array) {
        if (array == null || array.length == 0) return array;

        //create a sorted copy
        T[] copy = Arrays.copyOf(array, array.length);
        Arrays.sort(copy);

        //remove the duplicates -> they will be together
        int count = 0;
        T previous = copy[0];
        
        for (int i = 1; i < array.length; ++i) {
            //deal with possible nulls - skip through them
            if (previous == null) {
                for (; copy[i] == null && i < array.length; i++);
            }

            else if (previous.compareTo(copy[i]) != 0) {
                previous = copy[i];
                copy[++count] = previous;
            }
        }

        return Arrays.copyOf(copy, count + 1);
    }
}
Number of Elements Range LinkekHashSet Sort and Remove
10000 1000 7.3ms 11.6ms
100000 1000 17ms 89ms
100000 10000 29ms 92ms
1000000 10000 About 300ms Above 1500ms
1000000 100000 About 650ms Above 1700ms

From the results, you can clearly see LinkedHashSet to remove duplicates from objects are much faster than primitives. Not much to say here.

Primitives

import java.util.Arrays;
import java.util.LinkedHashSet;
import java.util.Set;

public class Testing {
    //test parameters
    public static final int NUM_ELEM = 10000;
    public static final int RANGE = 1000;
    
    public static void main(String[] args) {
        //create two arrays with same elements in same order
        int[] array1 = new int[NUM_ELEM];
        int[] array2 = new int[NUM_ELEM];
        for (int i = 0; i < NUM_ELEM; i++) {
            array1[i] = (int) (Math.random() * RANGE);
            array2[i] = array1[i];
        }
        
        long start;
        System.gc(); System.gc(); System.gc();
        start = System.currentTimeMillis();
        System.out.println(removeDuplicatesWithSet(array1));
        System.out.println(System.currentTimeMillis() - start);

        System.gc(); System.gc(); System.gc();
        start = System.currentTimeMillis();
        System.out.println(removeDuplicatesWithSort(array2));
        System.out.println(System.currentTimeMillis() - start);
    }

    public static int[] removeDuplicatesWithSet(int[] array) {
        if (array == null || array.length == 0) { return array; }

        //create a linked hash set that would remove all the duplicates of
        //the array while retaining order
        Set<Integer> set = new LinkedHashSet<Integer>();
        for (int i = 0, len = array.length; i < len; i++) {
            set.add(array[i]);
        }

        int[] noDups = new int[set.size()];
        int count = -1;
        for (Integer num : set) {
            noDups[++count] = num;
        }
        return noDups;
    }

    public static int[] removeDuplicatesWithSort(int[] array) {
        if (array == null || array.length == 0) { return array; }

        //create a sorted copy
        int[] copy = Arrays.copyOf(array, array.length);
        Arrays.sort(copy);

        //remove the duplicates -> they will be together
        int count = 0;
        int previous = copy[0];
        for (int i = 1; i < array.length; ++i) {
            if (previous != copy[i]) {
                previous = copy[i];
                copy[++count] = previous;
            }
        }

        return Arrays.copyOf(copy, count + 1);
    }
}
Number of Elements Range LinkekHashSet Sort and Remove
10000 1000 10.3ms 7.3ms
100000 10000 22ms 23ms
100000 100 16ms 17ms
100000 10000 32ms 26ms
100000 100000 103ms 34ms
1000000 100 90ms 106ms
1000000 10000 232ms 210ms
1000000 100000 760ms 350ms
1000000 1000000 Above 1500ms Around 300ms

Things has just gotten interesting. Sort and Remove is usually faster than using LinkedHashSet when working with primitives. Whether this due to the large amount of auto-boxing and auto-unboxing, object creation, and garbage collection is unknown. What is clear from the results are that the more elements there are in an array and the more number of different objects, the faster Sort and Remove run compare to LinkedHashSet. That is not to say Sort and Remove are good with primitives. You probably wouldn’t deal with 100000+ elements and/or incredibly widespread of different objects in normal Java applications in an array anyway. LinkedHashSet is still ultimately the pragmatic solution (yay I know big words).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: