Using regular expression in spark

17,516

Your regex will only match with word that are composed by a lowercase and then by an uppercase. i.e. aA, bA, rF etc. So it shouln't discard any of the components of your list.

So you may want to change it to this:

[a-zA-Z]*

So it will match any word composed only by letters (both lower and uppercase)

Then regarding the match issue, you are using the wrong method, the proper one to match a regex is something like this:

val pattern = """[a-zA-Z]*""".r

val filtered = rdd.filter(line => !pattern.pattern.matcher(line).matches)

And here the output:

scala> filtered.foreach(println)
12
13

You can check the API for regex here

Share:
17,516

Related videos on Youtube

Abhishek Allamsetty
Author by

Abhishek Allamsetty

Updated on June 04, 2022

Comments

  • Abhishek Allamsetty
    Abhishek Allamsetty almost 2 years

    I have trouble in using regular expression. My sample data is:

    12 13 hello hiiii hhhhh
    
    this doesnt have numeric so should be removed
    Even this line should be excluded
    `12` this line contains numeric shouldn't exclude
    Hope even this line should be excluded
    
    scala> val pattern = "[a-z][A-Z]".r                                                                                                                                                        
    

    pattern: scala.util.matching.Regex = [a-z][A-Z]

    scala> val b = a.filter(line => !line.startsWith(pattern))
    <console>:31: error: type mismatch;
    

    found : scala.util.matching.Regex required: String val b = a.filter(line => !line.startsWith(pattern)) ^

    or if I use

    scala> val b = a.filter(line => !line.startsWith("[a-z][A-Z]".r)).take(3)                                                                                                                  
    

    :29: error: type mismatch;
    found : scala.util.matching.Regex
    required: String

         val b = a.filter(line => !line.startsWith("[a-z][A-Z]".r)).take(3)                                                                                                                                                            ^          
    

    I am actually not sure how exactly to use regex in spark. Please help me out.

  • Arsinux
    Arsinux about 6 years
    This answer works!