'RDD' object has no attribute '_jdf' pyspark RDD

16,591

You shouldn't be using rdd with CountVectorizer. Instead you should try to form the array of words in the dataframe itself as

train_data = spark.read.text("20ng-train-all-terms.txt")

from pyspark.sql import functions as F
td= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))

from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)

And then it should work so that you can call transform function as

vectorizer_transformer.transform(td).show(truncate=False)

Now, if you want to stick to the old style of converting to the rdd style then you have to modify certain lines of code. Following is the modified complete code (working) of yours

from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()

train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()
from pyspark.ml.feature import CountVectorizer

vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(tr_data)

But I would suggest you to stick with dataframe way.

Share:
16,591

Related videos on Youtube

A.Dorra
Author by

A.Dorra

Updated on June 04, 2022

Comments

  • A.Dorra
    A.Dorra over 1 year

    I'm new in pyspark. I would like to perform some machine Learning on a text file.

    from pyspark import Row
    from pyspark.context import SparkContext
    from pyspark.sql.session import SparkSession
    from pyspark import SparkConf
    sc = SparkContext
    spark = SparkSession.builder.appName("ML").getOrCreate()
    
    train_data = spark.read.text("20ng-train-all-terms.txt")
    td= train_data.rdd #transformer df to rdd
    tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))
    from pyspark.ml.feature import CountVectorizer
    
    vectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words")
    vectorizer_transformer = vectorizer.fit(td)
    

    and for my last command, i obtain the error "AttributeError: 'RDD' object has no attribute '_jdf'

    enter image description here

    can anyone help me please. thank you

    • Arpit Solanki
      Arpit Solanki almost 6 years
      post complete traceback of the error
    • A.Dorra
      A.Dorra almost 6 years
      i posted a screen shot of the resulting error. thank you
    • desertnaut
      desertnaut almost 6 years
      CountVectorizer of pyspark. ml works on dataframes, not on RDDs (see examples and docs).
    • A.Dorra
      A.Dorra almost 6 years
      my input file is a a text without any structure.
    • A.Dorra
      A.Dorra almost 6 years
      Here is an example of my text file "alt.atheism alt atheism faq atheist resources archive name atheism resources alt atheism archive name resources last modified december version atheist resources addresses of atheist organizations usa freedom from religion foundation darwin fish bumper stickers and assorted other atheist paraphernalia are available from the freedom from religion foundation in the us write to ffrf p o box madison "