'RDD' object has no attribute '_jdf' pyspark RDD

python-3.x apache-spark machine-learning pyspark spark-dataframe

16,591

You shouldn't be using rdd with CountVectorizer. Instead you should try to form the array of words in the dataframe itself as

train_data = spark.read.text("20ng-train-all-terms.txt")

from pyspark.sql import functions as F
td= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))

from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)

And then it should work so that you can call transform function as

vectorizer_transformer.transform(td).show(truncate=False)

Now, if you want to stick to the old style of converting to the rdd style then you have to modify certain lines of code. Following is the modified complete code (working) of yours

from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()

train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()
from pyspark.ml.feature import CountVectorizer

vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(tr_data)

But I would suggest you to stick with dataframe way.

16,591

A.Dorra

Updated on June 04, 2022

Comments

A.Dorra almost 2 years
I'm new in pyspark. I would like to perform some machine Learning on a text file.
```
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()

train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))
from pyspark.ml.feature import CountVectorizer

vectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
```
and for my last command, i obtain the error "AttributeError: 'RDD' object has no attribute '_jdf'

enter image description here

can anyone help me please. thank you
- Arpit Solanki about 6 years
  
  post complete traceback of the error
- A.Dorra about 6 years
  
  i posted a screen shot of the resulting error. thank you
- desertnaut about 6 years
  
  CountVectorizer of pyspark. ml works on dataframes, not on RDDs (see examples and docs).
- A.Dorra about 6 years
  
  my input file is a a text without any structure.
- A.Dorra about 6 years
  
  Here is an example of my text file "alt.atheism alt atheism faq atheist resources archive name atheism resources alt atheism archive name resources last modified december version atheist resources addresses of atheist organizations usa freedom from religion foundation darwin fish bumper stickers and assorted other atheist paraphernalia are available from the freedom from religion foundation in the us write to ffrf p o box madison "