Regular expression matching fully qualified class names

39,447

Solution 1

A Java fully qualified class name (lets say "N") has the structure

N.N.N.N

The "N" part must be a Java identifier. Java identifiers cannot start with a number, but after the initial character they may use any combination of letters and digits, underscores or dollar signs:

([a-zA-Z_$][a-zA-Z\d_$]*\.)*[a-zA-Z_$][a-zA-Z\d_$]*
------------------------    -----------------------
          N                           N

They can also not be a reserved word (like import, true or null). If you want to check plausibility only, the above is enough. If you also want to check validity, you must check against a list of reserved words as well.

Java identifiers may contain any Unicode letter instead of "latin only". If you want to check for this as well, use Unicode character classes:

([\p{Letter}_$][\p{Letter}\p{Number}_$]*\.)*[\p{Letter}_$][\p{Letter}\p{Number}_$]*

or, for short

([\p{L}_$][\p{L}\p{N}_$]*\.)*[\p{L}_$][\p{L}\p{N}_$]*

The Java Language Specification, (section 3.8) has all details about valid identifier names.

Also see the answer to this question: Java Unicode variable names

Solution 2

Here is a fully working class with tests, based on the excellent comment from @alan-moore

import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;

import java.util.regex.Pattern;

import org.junit.Test;

public class ValidateJavaIdentifier {

    private static final String ID_PATTERN = "\\p{javaJavaIdentifierStart}\\p{javaJavaIdentifierPart}*";
    private static final Pattern FQCN = Pattern.compile(ID_PATTERN + "(\\." + ID_PATTERN + ")*");

    public static boolean validateJavaIdentifier(String identifier) {
        return FQCN.matcher(identifier).matches();
    }


    @Test
    public void testJavaIdentifier() throws Exception {
        assertTrue(validateJavaIdentifier("C"));
        assertTrue(validateJavaIdentifier("Cc"));
        assertTrue(validateJavaIdentifier("b.C"));
        assertTrue(validateJavaIdentifier("b.Cc"));
        assertTrue(validateJavaIdentifier("aAa.b.Cc"));
        assertTrue(validateJavaIdentifier("a.b.Cc"));

        // after the initial character identifiers may use any combination of
        // letters and digits, underscores or dollar signs
        assertTrue(validateJavaIdentifier("a.b.C_c"));
        assertTrue(validateJavaIdentifier("a.b.C$c"));
        assertTrue(validateJavaIdentifier("a.b.C9"));

        assertFalse("cannot start with a dot", validateJavaIdentifier(".C"));
        assertFalse("cannot have two dots following each other",
                validateJavaIdentifier("b..C"));
        assertFalse("cannot start with a number ",
                validateJavaIdentifier("b.9C"));
    }
}

Solution 3

The pattern provided by Renaud works, but his original answer will always backtrack at the end.

To optimize it, you can essentially swap the first half with the last. Note the dot match that you also need to change.

The following is my version of it that, when compared to the original, runs about twice as fast:

String ID_PATTERN = "\\p{javaJavaIdentifierStart}\\p{javaJavaIdentifierPart}*";
Pattern FQCN = Pattern.compile(ID_PATTERN + "(\\." + ID_PATTERN + ")*");

I cannot write comments, so I decided to write an answer instead.

Solution 4

I came (on my own) to a similar answer (as Tomalak's answer), something as M.M.M.N:

([a-z][a-z_0-9]*\.)*[A-Z_]($[A-Z_]|[\w_])*

Where,

M = ([a-z][a-z_0-9]*\.)*
N = [A-Z_]($[A-Z_]|[\w_])*

However, this regular expression (unlike Tomalak's answer) makes more assumptions:

  1. The package name (The M part) will be only in lower case, the first character of M will be always a lower letter, the rest can mix underscore, lower letters and numbers.

  2. The Class Name (the N part) will always start with an Upper Case Letter or an underscore, the rest can mix underscore, letters and numbers. Inner Classes will always start with a dollar symbol ($) and must obey the class name rules described previously.

Note: the pattern \w is the XSD pattern for letters and digits (it does not includes the underscore symbol (_))

Hope this help.

Share:
39,447

Related videos on Youtube

Chun ping Wang
Author by

Chun ping Wang

I am software engineer programming in Java. Beside programming, I like read Bible and do my best to follow the command of Christ. Which means i shut down work after 5:00 pm on Friday to Saturday 7:30 pm. My football team is Baltimore Ravens, baseball San Francisco Giant, Hockey La Kings and for basketball, i just bandwagon on a good team.

Updated on October 03, 2020

Comments

  • Chun ping Wang
    Chun ping Wang over 3 years

    What is the best way to match fully qualified Java class name in a text?

    Examples: java.lang.Reflect, java.util.ArrayList, org.hibernate.Hibernate.

    • Johan Sjöberg
      Johan Sjöberg over 13 years
      What context do these appear in, java import statements? If there's only the ; to remove then don't use regex
    • Hollis Waite
      Hollis Waite about 8 years
      Forget regular expressions; see javax.lang.model.SourceVersion.isName(CharSequence).
  • Johan Sjöberg
    Johan Sjöberg over 13 years
    You don't need the [], this should be enough (\\w+\\.?)+
  • krtek
    krtek over 13 years
    I think the [] makes things clearer, regexp are already messy enough ;) and I let the last bit outside to clearly separate packages from class name.
  • Richard Miskin
    Richard Miskin over 13 years
    Java identifiers can start with any currency symbol so $val, £val and ¥val are all valid. I think this is applies to classes as well as variables. See the java api download.oracle.com/javase/1.5.0/docs/api/java/lang/…
  • Tomalak
    Tomalak over 13 years
    @Richard: Okay, thanks for the info. Then \p{Currency_Symbol} or \p{Sc} should be used instead of $. Thinking about it, a small parser that calls isJavaIdentifierPart() and isJavaIdentifierStart() repeatedly would result in cleaner code.
  • Richard Miskin
    Richard Miskin over 13 years
    I agree a parser is the way to do it, it's almost as if the Java language designers wrote the Character API with this in mind ;) However the question is about a regex so I think you've got the correct answer. +1 from me.
  • Alan Moore
    Alan Moore over 13 years
    Actually, those methods are already represented by special character classes. All we need to match a Java identifier is "(\\p{javaJavaIdentifierStart}\\p{javaJavaIdentifierPart}*\\‌​.)+\\p{javaJavaIdent‌​ifierStart}\\p{javaJ‌​avaIdentifierPart}*"‌​. Elegance, thy name is Java!
  • Tomalak
    Tomalak over 13 years
    @Alan: Very nice! Thank you. :-)
  • Chun ping Wang
    Chun ping Wang over 13 years
    I want to see if the given input is a good java class name (fully qualify package), using hibernate validatior (annotation style via @Pattern).
  • aliteralmind
    aliteralmind about 10 years
    @AlanMoore: Very nice. Unfortunately, RegexBuddy 3.6.2 doesn't recognize these classes under its Java flavor. It also doesn't recognize \p{Currency_Symbol}, but it does recognize \p{Sc}. Haven't tested much further, but I'm going to have to, because RegexBuddy is pretty important to my workflow.
  • aliteralmind
    aliteralmind about 10 years
    RegexBuddy can handle this: ([\p{L}_\p{Sc}][\p{L}\p{N}_\p{Sc}]*\.)+
  • Alan Moore
    Alan Moore about 10 years
    Yeah, RegexBuddy 4 doesn't recognize the "javaJava" categories either, which is just as well. But RB4 has made huge improvements, both in the number of flavors covered and in the thoroughness of the coverage. Nice new features, too; the flavor conversion feature alone is worth the price of the upgrade.
  • android developer
    android developer about 9 years
    I think the first part shouldn't include the "$" character.
  • Tomalak
    Tomalak about 9 years
    @androiddeveloper $ is a valid identifier start character in Java.
  • android developer
    android developer about 9 years
    @Tomalak It does? but isn't it for telling of an inner class? What does it mean if you put it in the beginning ?
  • Tomalak
    Tomalak about 9 years
    I suppose it would take you less than five minutes to try it and find out.
  • android developer
    android developer about 9 years
    @Tomalak Well on Android, it's limited to just English letters, digits, and underscore : developer.android.com/guide/topics/manifest/… . It was a long time ago that I've written pure, official Java. I've now searched for "$ character package name" on Google, and I don't think it show the answer ...
  • Tomalak
    Tomalak about 9 years
    That page outlines conventions that apply to Android development, not rules that apply to the Java language. That's here: docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.8‌​. It states that the $ should not be used, it does not state that it is illegal.
  • TWiStErRob
    TWiStErRob almost 8 years
    @androiddeveloper also "Java-language-style package name" is not the same thing as "JLS conformant package name".
  • TWiStErRob
    TWiStErRob almost 8 years
    VALID_JAVA_IDENTIFIER is bad choice for the name as that pattern represents a FQCN. I suggest extracting String ID_PATTERN = "\\p{javaJavaIdentifierStart}\\p{javaJavaIdentifierPart}*" to make it more obvious and readable.
  • Renaud
    Renaud almost 8 years
    @TWiStErRob not sure what you mean when you say that VALID_JAVA_IDENTIFIER represents a FQCN? Plus, not sure ID_PATTERN is more readable... Thanks for explaining.
  • TWiStErRob
    TWiStErRob almost 8 years
    A valid java identifier can be method name, local variable, class name, subpackage name, etc.. Your "VALID_JAVA_IDENTIFIER" pattern, however, matches a fully qualified class name (FQCN) consisting of multiple identifiers (one for each subpackage + class name). FQCN is not a valid java identifier, because it contains dots. For ID_PATTERN see my edit on Jörgen's answer; it's easier to see what gets repeated and when, you also don't have to scroll or break lines.
  • Ilya Kharlamov
    Ilya Kharlamov almost 3 years
    Case sensitivity matters