Sort on a string that may contain a number

92,702

Solution 1

The Alphanum Algorithm

From the website

"People sort strings with numbers differently than software. Most sorting algorithms compare ASCII values, which produces an ordering that is inconsistent with human logic. Here's how to fix it."

Edit: Here's a link to the Java Comparator Implementation from that site.

Solution 2

Interesting little challenge, I enjoyed solving it.

Here is my take at the problem:

String[] strs =
{
  "eee 5 ddd jpeg2001 eee",
  "eee 123 ddd jpeg2000 eee",
  "ddd",
  "aaa 5 yy 6",
  "ccc 555",
  "bbb 3 ccc",
  "bbb 9 a",
  "",
  "eee 4 ddd jpeg2001 eee",
  "ccc 11",
  "bbb 12 ccc",
  "aaa 5 yy 22",
  "aaa",
  "eee 3 ddd jpeg2000 eee",
  "ccc 5",
};

Pattern splitter = Pattern.compile("(\\d+|\\D+)");

public class InternalNumberComparator implements Comparator
{
  public int compare(Object o1, Object o2)
  {
    // I deliberately use the Java 1.4 syntax, 
    // all this can be improved with 1.5's generics
    String s1 = (String)o1, s2 = (String)o2;
    // We split each string as runs of number/non-number strings
    ArrayList sa1 = split(s1);
    ArrayList sa2 = split(s2);
    // Nothing or different structure
    if (sa1.size() == 0 || sa1.size() != sa2.size())
    {
      // Just compare the original strings
      return s1.compareTo(s2);
    }
    int i = 0;
    String si1 = "";
    String si2 = "";
    // Compare beginning of string
    for (; i < sa1.size(); i++)
    {
      si1 = (String)sa1.get(i);
      si2 = (String)sa2.get(i);
      if (!si1.equals(si2))
        break;  // Until we find a difference
    }
    // No difference found?
    if (i == sa1.size())
      return 0; // Same strings!

    // Try to convert the different run of characters to number
    int val1, val2;
    try
    {
      val1 = Integer.parseInt(si1);
      val2 = Integer.parseInt(si2);
    }
    catch (NumberFormatException e)
    {
      return s1.compareTo(s2);  // Strings differ on a non-number
    }

    // Compare remainder of string
    for (i++; i < sa1.size(); i++)
    {
      si1 = (String)sa1.get(i);
      si2 = (String)sa2.get(i);
      if (!si1.equals(si2))
      {
        return s1.compareTo(s2);  // Strings differ
      }
    }

    // Here, the strings differ only on a number
    return val1 < val2 ? -1 : 1;
  }

  ArrayList split(String s)
  {
    ArrayList r = new ArrayList();
    Matcher matcher = splitter.matcher(s);
    while (matcher.find())
    {
      String m = matcher.group(1);
      r.add(m);
    }
    return r;
  }
}

Arrays.sort(strs, new InternalNumberComparator());

This algorithm need much more testing, but it seems to behave rather nicely.

[EDIT] I added some more comments to be clearer. I see there are much more answers than when I started to code this... But I hope I provided a good starting base and/or some ideas.

Solution 3

Ian Griffiths of Microsoft has a C# implementation he calls Natural Sorting. Porting to Java should be fairly easy, easier than from C anyway!

UPDATE: There seems to be a Java example on eekboom that does this, see the "compareNatural" and use that as your comparer to sorts.

Solution 4

The implementation I propose here is simple and efficient. It does not allocate any extra memory, directly or indirectly by using regular expressions or methods such as substring(), split(), toCharArray(), etc.

This implementation first goes across both strings to search for the first characters that are different, at maximal speed, without doing any special processing during this. Specific number comparison is triggered only when these characters are both digits. A side-effect of this implementation is that a digit is considered as greater than other letters, contrarily to default lexicographic order.

public static final int compareNatural (String s1, String s2)
{
   // Skip all identical characters
   int len1 = s1.length();
   int len2 = s2.length();
   int i;
   char c1, c2;
   for (i = 0, c1 = 0, c2 = 0; (i < len1) && (i < len2) && (c1 = s1.charAt(i)) == (c2 = s2.charAt(i)); i++);

   // Check end of string
   if (c1 == c2)
      return(len1 - len2);

   // Check digit in first string
   if (Character.isDigit(c1))
   {
      // Check digit only in first string 
      if (!Character.isDigit(c2))
         return(1);

      // Scan all integer digits
      int x1, x2;
      for (x1 = i + 1; (x1 < len1) && Character.isDigit(s1.charAt(x1)); x1++);
      for (x2 = i + 1; (x2 < len2) && Character.isDigit(s2.charAt(x2)); x2++);

      // Longer integer wins, first digit otherwise
      return(x2 == x1 ? c1 - c2 : x1 - x2);
   }

   // Check digit only in second string
   if (Character.isDigit(c2))
      return(-1);

   // No digits
   return(c1 - c2);
}

Solution 5

I came up with a quite simple implementation in Java using regular expressions:

public static Comparator<String> naturalOrdering() {
    final Pattern compile = Pattern.compile("(\\d+)|(\\D+)");
    return (s1, s2) -> {
        final Matcher matcher1 = compile.matcher(s1);
        final Matcher matcher2 = compile.matcher(s2);
        while (true) {
            final boolean found1 = matcher1.find();
            final boolean found2 = matcher2.find();
            if (!found1 || !found2) {
                return Boolean.compare(found1, found2);
            } else if (!matcher1.group().equals(matcher2.group())) {
                if (matcher1.group(1) == null || matcher2.group(1) == null) {
                    return matcher1.group().compareTo(matcher2.group());
                } else {
                    return Integer.valueOf(matcher1.group(1)).compareTo(Integer.valueOf(matcher2.group(1)));
                }
            }
        }
    };
}

Here is how it works:

final List<String> strings = Arrays.asList("x15", "xa", "y16", "x2a", "y11", "z", "z5", "x2b", "z");
strings.sort(naturalOrdering());
System.out.println(strings);

[x2a, x2b, x15, xa, y11, y16, z, z, z5]

Share:
92,702

Related videos on Youtube

Paul Tomblin
Author by

Paul Tomblin

Java/C++/Perl/Python/Javascript/C# programmer on Linux/Unix/macOS/Windows. 25+ years experience. Still learning. http://www.linkedin.com/in/paultomblin

Updated on March 25, 2021

Comments

  • Paul Tomblin
    Paul Tomblin about 3 years

    I need to write a Java Comparator class that compares Strings, however with one twist. If the two strings it is comparing are the same at the beginning and end of the string are the same, and the middle part that differs is an integer, then compare based on the numeric values of those integers. For example, I want the following strings to end up in order they're shown:

    • aaa
    • bbb 3 ccc
    • bbb 12 ccc
    • ccc 11
    • ddd
    • eee 3 ddd jpeg2000 eee
    • eee 12 ddd jpeg2000 eee

    As you can see, there might be other integers in the string, so I can't just use regular expressions to break out any integer. I'm thinking of just walking the strings from the beginning until I find a bit that doesn't match, then walking in from the end until I find a bit that doesn't match, and then comparing the bit in the middle to the regular expression "[0-9]+", and if it compares, then doing a numeric comparison, otherwise doing a lexical comparison.

    Is there a better way?

    Update I don't think I can guarantee that the other numbers in the string, the ones that may match, don't have spaces around them, or that the ones that differ do have spaces.

  • Nick Johnson
    Nick Johnson over 15 years
    This doesn't entirely solve the problem - you'd need to tokenise the string to be sorted and sort using this algorithm on each piece individually.
  • PhiLho
    PhiLho over 15 years
    Note: Paul accepted your answer but my algorithm sticks more closely to his problem (the way it explained it!), for cases like "Allegia 51B Clasteron". Not a problem, he choose whatever fit his needs, and this Alphanum implementation is fine (and multilanguage!), I just wanted to point that out. :-P
  • HRgiger
    HRgiger almost 8 years
    nice one! An additional null and instanceof String check would be nice too
  • PhiLho
    PhiLho almost 8 years
    @HRgiger You have a point about null check, I assumed the array was "sane". But today, I would just ditch the pre-Java 1.5 syntax and use generics, not instanceof.
  • Klitos Kyriacou
    Klitos Kyriacou over 6 years
    This implementation deals with the OP's specific example inputs, but for general use be aware that it fails to cope with numbers that have leading zeros. It thinks that "01234" is greater than "5678".
  • Lukas Eder
    Lukas Eder about 6 years
    You may want to cache your Pattern.compile() calls, given that they're called with O(N log N) complexity!
  • JustinKSU
    JustinKSU about 6 years
    Good suggestion. Code is updated. Scanner is also now closed using "try with resources".
  • Holger
    Holger over 5 years
    Instead of dealing with Scanner, you could simply call NUMBER_PATTERN.matcher(s), followed by repeatedly calling find on the returned Matcher. The great thing is that the matcher will tell you the start and end position for every match, making the entire split operation trivial. And it’s not a resource demanding a try(…) {…} block.
  • JustinKSU
    JustinKSU over 5 years
    @Holger Interesting idea. I would implement it and put as a separate answer. I'll throw you an upvote.
  • Holger
    Holger over 5 years
    I don’t know whether it’s unique enough to deserve another answer. After all, it would still do the same. By the way, the initial statement if(str1 == null || str2 == null) { return 0; } is broken, as it implies that if either argument is null, it will be reported to be equal to the other argument. But when null is equal to any other input, all inputs must be equal (the transitivity rule). The easiest solution would be not to support null at all. Otherwise, you would have to use something like if(str1 == str2) return 0; if(str1 == null) return 1; if(str2 == null) return -1;.
  • JustinKSU
    JustinKSU over 5 years
    @Holger I disagree on how to handle nulls stackoverflow.com/a/2401629/724835
  • Holger
    Holger over 5 years
    That linked answer describes a consistent behavior. It even matches the second of my suggestions, if(str1 == str2) return 0; if(str1 == null) return 1; if(str2 == null) return -1;. The code of your answer returns zero in all three cases which is inconsistent and violates the contract of Comparator. It does not match the behavior described in the linked answer.
  • JustinKSU
    JustinKSU over 5 years
    @Holger I stand correct. I misread your first suggestion. Please update my answer to handle nulls better.
  • Michael Böckling
    Michael Böckling almost 5 years
    I like it because it is readable. I propose changing the for loops to while loops instead, like this: while ((x1 < len1) && Character.isDigit(s1.charAt(x1))) { x1++;}
  • Olivier OUDOT
    Olivier OUDOT over 4 years
    @Michael, can you please explain why you think it is better ? For me it is exactly the same.....
  • Olivier OUDOT
    Olivier OUDOT over 4 years
    I have made notable performance improvements by adding a local static final method isDigit() instead of using Character.isDigit(). I suppose this favors inline code expansion at compile time.
  • Mike
    Mike about 3 years
    gives wrong result for "1000X Radonius Maximus" and "10X Radonius"
  • Mike
    Mike about 3 years
    reproduced java.lang.IllegalArgumentException: Comparison method violates its general contract!
  • Mike
    Mike about 3 years
    Great code! I would only do it case insensitive with char ch1 = Character.toUpperCase(s1.charAt(i1)); so that 1000a be less than 1000X
  • Daniel Alder
    Daniel Alder almost 3 years
    I made some changes for sorting leading zeroes: pastebin.com/tbEYj2zf
  • Yaxita Shah
    Yaxita Shah almost 3 years
    Excellent !! Thanks a lot