String merge sort in Java

Question

This snippet is about sorting a set of strings using merge sort. However, this version is tailored for strings as it relies on lcp values:

Given two strings \$A\$ and \$B\$, \$lcp(A, B)\$ is the length of the longest prefix shared by both \$A\$ and \$B\$. The cool point to note here is that if you know at least a lower bound of \$lcp(A, B)\$ (say, \$k < lcp(A, B)\$), whenever comparing \$A\$ and \$B\$, you can skip the first \$k\$ characters, and this string merge sort relies on these values in order to speed up the computation.

StringMergesort.java:

package net.coderodde.util;

import java.util.Arrays;
import java.util.Random;

public class StringMergesort {

    private static final int ALPHABET_SIZE = 26;

    private static final class ComparisonResult {
        int cmpValue;
        int lcp;
    }

    public static void sort(String[] array) {
        sort(array, 0, array.length);
    }

    public static void sort(String[] array, int fromIndex, int toIndex) {
        if (toIndex - fromIndex < 2) {
            return;
        }

        String[] auxArray = array.clone();
        int[] sourceLCPArray = new int[auxArray.length];
        int[] targetLCPArray = new int[auxArray.length];
        ComparisonResult result = new ComparisonResult();

        sortImpl(auxArray, 
                 array, 
                 sourceLCPArray,
                 targetLCPArray,
                 fromIndex, 
                 toIndex,
                 result);
    }

    private static void lcpCompare(String a, 
                                   String b, 
                                   int k,
                                   ComparisonResult comparisonResult) {
        int length = Math.min(a.length(), b.length());

        for (int i = k; i < length; ++i) {
            char ach = a.charAt(i);
            char bch = b.charAt(i);

            if (ach != bch) {
                comparisonResult.cmpValue = Character.compare(ach, bch);
                comparisonResult.lcp = i;
                return;
            }
        }

        comparisonResult.lcp = length;

        if (a.length() > length) {
            comparisonResult.cmpValue = 1;
        } else if (a.length() < length) {
            comparisonResult.cmpValue = -1;
        } else {
            comparisonResult.cmpValue = 0;
        }
    }

    private static void sortImpl(String[] source,
                                 String[] target,
                                 int[] sourceLCPArray,
                                 int[] targetLCPArray,
                                 int fromIndex,
                                 int toIndex,
                                 ComparisonResult result) {
        int rangeLength = toIndex - fromIndex;

        if (rangeLength < 2) {
            return;
        }

        int middle = fromIndex + ((toIndex - fromIndex) >> 1);

        sortImpl(target, 
                 source,
                 targetLCPArray,
                 sourceLCPArray,
                 fromIndex, 
                 middle,
                 result);

        sortImpl(target, 
                 source, 
                 targetLCPArray,
                 sourceLCPArray,
                 middle, 
                 toIndex,
                 result);

        merge(source,
              target,
              sourceLCPArray,
              targetLCPArray,
              fromIndex,
              middle - fromIndex,
              toIndex - middle,
              result);
    }

    private static void merge(String[] source, 
                              String[] target,
                              int[] sourceLCPArray,
                              int[] targetLCPArray,
                              int offset,
                              int leftRangeLength,
                              int rightRangeLength,
                              ComparisonResult result) {
        int left = offset;
        int leftBound = offset + leftRangeLength;
        int right = leftBound;
        int rightBound = right + rightRangeLength;
        int targetIndex = offset;

        while (left < leftBound && right < rightBound) {
            int leftLCP  = sourceLCPArray[left];
            int rightLCP = sourceLCPArray[right];

            if (leftLCP > rightLCP) {
                target[targetIndex] = source[left];
                targetLCPArray[targetIndex++] = sourceLCPArray[left++];
            } else if (rightLCP > leftLCP) {
                target[targetIndex] = source[right];
                targetLCPArray[targetIndex++] = sourceLCPArray[right++];
            } else {
                lcpCompare(source[left], source[right], leftLCP, result);

                if (result.cmpValue <= 0) {
                    target[targetIndex] = source[left];
                    targetLCPArray[targetIndex++] = sourceLCPArray[left++];
                    sourceLCPArray[right] = result.lcp;
                } else {
                    target[targetIndex] = source[right];
                    targetLCPArray[targetIndex++] = sourceLCPArray[right++];
                    sourceLCPArray[left] = result.lcp;
                }
            }
        }

        while (left < leftBound) {
            target[targetIndex] = source[left];
            targetLCPArray[targetIndex] = sourceLCPArray[left];
            targetIndex++;
            left++;
        }

        while (right < rightBound) {
            target[targetIndex] = source[right];
            targetLCPArray[targetIndex] = sourceLCPArray[right];
            targetIndex++;
            right++;
        }
    }

    public static void main(String[] args) {
        long seed = System.nanoTime();
        Random random = new Random(seed);
        String[] array1 = createRandomStringArray(500_000, 300, random);
        String[] array2 = array1.clone();

        System.out.println("Seed = " + seed);

        long startTime = System.nanoTime();
        Arrays.sort(array1);
        long endTime = System.nanoTime();

        System.out.format("Arrays.sort in %.2f milliseconds.\n",
                          (endTime - startTime) / 1e6);

        startTime = System.nanoTime();
        StringMergesort.sort(array2);
        endTime = System.nanoTime();

        System.out.format("StringMergesort.sort in %.2f milliseconds.\n",
                          (endTime - startTime) / 1e6);

        System.out.println("Arrays equal: " + Arrays.equals(array1, array2));
    }

    private static String[] createRandomStringArray(int size, 
                                                    int maxLength, 
                                                    Random random) {
        String[] ret = new String[size];

        for (int i = 0; i < size; ++i) {
            ret[i] = randomString(maxLength, random);
        }

        return ret;
    }

    private static String randomString(int maxLength, Random random) {
        int length = random.nextInt(maxLength + 1);
        StringBuilder sb = new StringBuilder(length);

        for (int i = 0; i < length; ++i) {
            sb.append((char)('a' + random.nextInt(ALPHABET_SIZE)));
        }

        return sb.toString();
    }    
}

My performance figures are:

Seed = 5896859395728
Arrays.sort in 1220.41 milliseconds.
StringMergesort.sort in 669.97 milliseconds.
Arrays equal: true

Is there anything to improve here? Once again, I would like to hear comments on naming conventions, coding style, performance, and optimisation opportunities.

Please remember to check existing tags before creating a new one. — Jamal, Commented Dec 28, 2015 at 18:58

greybeard · Accepted Answer · 2016-01-03 07:18:13Z

(Had to sort out what I wanted from this code, leaving:)
What is your goal coding this? From the amount of explanation, it does not seem to be educational. From the embedded micro-benchmark, it might be benchmarking an implementation of LCP mergesort.
I've tried to follow some of the thoughts presented below on Ideone, sort of minimally invasive. (The doc-comments are barely readable without tool support - I used Eclipse Mars. I've taken it quite a bit further - too early to put that in the open.)

Code should be commented - at least javadoc comments for public (and protected) members. (I'm still yearning for a quotable, easy to understand presentation of LCP merge-sort, esp. an intuition why the LCP handling in merge is sufficient, what string is the other one in common. Not quite easy enough: A. Eberle's thesis "3.1.1. LCP-Compare" and "3.1.2. Binary LCP-Merge [and] Mergesort" (pp. 19-22).)
Consider making sortImpl an instance method (you didn't hide the constructor, anyway), instead of passing around the arrays (and the ComparisonResult) in each and every call; lcpCompare() too, while you're at it. (non-trivial)
(Using a static sortImpl,) I'd drop auxArray: sortImpl(array.clone(), array, …, new ComparisonResult()).
Cloning the array as opposed to instantiation deserves a comment
interchanging the roles of source and target in the recursive calls in sortImpl deserves a comment.
int k in lcpCompare() might be named from, prefixLength, mightDiffer or postPrefix.
I'd much prefer sourceLCP/targetLCP over …LCPArray.
I'd rather have index increment in the "copy rest of run loops" look consistent with the merge loop.
I'm not sure my reasoning about the LCP handling in merge() is correct: please comment this part of your code. (and check the @param comments on Ideone)

(Roll-your-own micro benchmarks are even more dangerous than tool-supported ones - I don't intend to go into main() & co. any further.)

kph0x1 · Accepted Answer · 2016-01-08 19:07:22Z

1

Use final:

public static void sort(final String[] array, final int fromIndex, final int toIndex) {
    if (toIndex - fromIndex < 2) {
        return;
    }

    final String[] auxArray = array.clone();
    final int[] sourceLCPArray = new int[auxArray.length];
    final int[] targetLCPArray = new int[auxArray.length];
    final ComparisonResult result = new ComparisonResult();

Change void returns into this. Use StringMergesort as an object.

Modify returns of mutable objects, such as in createRandomStringArray:

    String[] ret = new String[size];

    for (int i = 0; i < size; ++i) {
        ret[i] = randomString(maxLength, random);
    }

    return ret;

Less hazardous:

    final String[] ret = new String[size];

    for (int i = 0; i < size; ++i) {
        ret[i] = randomString(maxLength, random);
    }

    return Arrays.copyOf(ret,ret.length);

answered Jan 8, 2016 at 19:07

kph0x1

3691 silver badge9 bronze badges

\$\begingroup\$ I had that habit of putting final everywhere, yet I was discouraged here at codereview to do so. \$\endgroup\$
– coderodde
Commented Jan 8, 2016 at 19:08
\$\begingroup\$ final is a favorite, helps distinguish when something is going to be mutable and be purposefully vulnerable to being reassigned, dwheeler.com/secure-programs/Secure-Programs-HOWTO.html section 10.6 Java. null and void are ones I try to aggressively remove. \$\endgroup\$
– kph0x1
Commented Jan 8, 2016 at 19:16
\$\begingroup\$ Mixed feelings about final here: writing new code, I use it everywhere I don't foresee value change (and almost never on method members). "Maintaining code", I don't regularly introduce it in declarations I don't change for a a more compelling reason (which might be utter conviction the previous formatting impairs readability). \$\endgroup\$
– greybeard
Commented Jan 8, 2016 at 19:36
\$\begingroup\$ @kph0x1 immutability rocks, but in Java (where it's not the default) it's cutting against the grain of the language. When over 90% of all variables, parameters etc. would have to be marked as final, it's time to ask whether the noise we're creating is really worth it, or are we perhaps programming in the wrong language? Personally I use final when I mean to specifically emphasize the intention of immutability, if I used it on every value that happens not to get modified, that emphasis would be lost. I say keep your methods small and leave it for the reader to see the intention \$\endgroup\$
– Konrad Morawski
Commented Apr 7, 2016 at 21:49
\$\begingroup\$ Note that I'm not saying the technique itself is wrong, quite the contrary (and I agree with avoiding nulls, too). I appreciate functional programming flavor more than the next guy. I just mean that we're not under some obligation to always explicitly enforce immutability on language level if it comes at the cost of readability, and unfortunately in this case it's likely. This is entirely a matter of taste of course, programmers like their 0s and 1s, so they feel uncomfortable about compromise positions : ) \$\endgroup\$
– Konrad Morawski
Commented Apr 7, 2016 at 21:56

Add a comment |

Stack Exchange Network

String merge sort in Java

2 Answers 2

Your Answer

Linked

Hot Network Questions

String merge sort in Java

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions