'RDD' object has no attribute '_jdf' pyspark RDD
16,591
You shouldn't be using rdd
with CountVectorizer
. Instead you should try to form the array of words in the dataframe
itself as
train_data = spark.read.text("20ng-train-all-terms.txt")
from pyspark.sql import functions as F
td= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
And then it should work so that you can call transform
function as
vectorizer_transformer.transform(td).show(truncate=False)
Now, if you want to stick to the old style of converting to the rdd style then you have to modify certain lines of code. Following is the modified complete code (working) of yours
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(tr_data)
But I would suggest you to stick with dataframe
way.
Related videos on Youtube
Author by
A.Dorra
Updated on June 04, 2022Comments
-
A.Dorra over 1 year
I'm new in pyspark. I would like to perform some machine Learning on a text file.
from pyspark import Row from pyspark.context import SparkContext from pyspark.sql.session import SparkSession from pyspark import SparkConf sc = SparkContext spark = SparkSession.builder.appName("ML").getOrCreate() train_data = spark.read.text("20ng-train-all-terms.txt") td= train_data.rdd #transformer df to rdd tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:])) from pyspark.ml.feature import CountVectorizer vectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words") vectorizer_transformer = vectorizer.fit(td)
and for my last command, i obtain the error "AttributeError: 'RDD' object has no attribute '_jdf'
can anyone help me please. thank you
-
Arpit Solanki almost 6 yearspost complete traceback of the error
-
A.Dorra almost 6 yearsi posted a screen shot of the resulting error. thank you
-
desertnaut almost 6 years
-
A.Dorra almost 6 yearsmy input file is a a text without any structure.
-
A.Dorra almost 6 yearsHere is an example of my text file "alt.atheism alt atheism faq atheist resources archive name atheism resources alt atheism archive name resources last modified december version atheist resources addresses of atheist organizations usa freedom from religion foundation darwin fish bumper stickers and assorted other atheist paraphernalia are available from the freedom from religion foundation in the us write to ffrf p o box madison "
-