ModuleNotFoundError: No module named 'pyarrow'
20,266
Solution 1
I had the same problem in a AWS EMR. I tried the other answers but they didn't work.
The only solution worked for me was install pyarrow
with pip instead conda
.
pip install pyarrow
I really don't know why, but it can be useful if you have same problem.
Solution 2
Delete your __pycache__
files.
I had exactly the same problem and this solved it for me.
Author by
spartacus
Updated on July 18, 2021Comments
-
spartacus almost 3 years
I am trying to run a simple pandas UDF example on my server. From here
I have created a fresh environment just for the purpose of running this code.
(PySparkEnv) $ conda list # packages in environment at /home/shekhar/.conda/envs/PySparkEnv: # # Name Version Build Channel arrow-cpp 0.10.0 py36h70250a7_0 conda-forge blas 1.0 mkl boost-cpp 1.67.0 h3a22d5f_0 conda-forge bzip2 1.0.6 h470a237_2 conda-forge ca-certificates 2018.8.24 ha4d7672_0 conda-forge certifi 2018.8.24 py36_1 conda-forge icu 58.2 hfc679d8_0 conda-forge intel-openmp 2019.0 117 libffi 3.2.1 hfc679d8_5 conda-forge libgcc-ng 7.2.0 hdf63c60_3 conda-forge libgfortran-ng 7.2.0 hdf63c60_3 conda-forge libstdcxx-ng 7.2.0 hdf63c60_3 conda-forge mkl 2019.0 117 mkl_fft 1.0.6 py36_0 conda-forge mkl_random 1.0.1 py36_0 conda-forge ncurses 6.1 hfc679d8_1 conda-forge numpy 1.15.0 py36h1b885b7_0 numpy-base 1.15.0 py36h3dfced4_0 openssl 1.0.2p h470a237_0 conda-forge pandas 0.23.4 py36hf8a1672_0 conda-forge parquet-cpp 1.5.0.pre h83d4a3d_0 conda-forge pip 18.0 py36_1 conda-forge py4j 0.10.7 py_1 conda-forge pyarrow 0.10.0 py36hfc679d8_0 conda-forge pyspark 2.3.1 py36_1 conda-forge python 3.6.6 h5001a0f_0 conda-forge python-dateutil 2.7.3 py_0 conda-forge pytz 2018.5 py_0 conda-forge readline 7.0 haf1bffa_1 conda-forge setuptools 40.2.0 py36_0 conda-forge six 1.11.0 py36_1 conda-forge sqlite 3.24.0 h2f33b56_1 conda-forge tk 8.6.8 0 conda-forge wheel 0.31.1 py36_1 conda-forge xz 5.2.4 h470a237_1 conda-forge zlib 1.2.11 h470a237_3 conda-forge
Then I run the following code:
from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql.dataframe import DataFrame from pyspark.sql.types import * from pyspark.sql.functions import col, pandas_udf, PandasUDFType import pandas as pd import os os.environ['PYSPARK_PYTHON'] = '/usr/local/anaconda3/bin/python3' SparkContext.setSystemProperty('spark.executor.memory', '30g') SparkContext.setSystemProperty('spark.executor.cores', '5') spark = SparkSession.builder.appName("Python Spark SQL basic example").getOrCreate() # Declare the function and create the UDF def multiply_func(a, b): return a * b multiply = pandas_udf(multiply_func, returnType=LongType()) # The function for a pandas_udf should be able to execute with local Pandas data x = pd.Series([1, 2, 3]) print(multiply_func(x, x)) # 0 1 # 1 4 # 2 9 # dtype: int64 # Create a Spark DataFrame, 'spark' is an existing SparkSession df = spark.createDataFrame(pd.DataFrame(x, columns=["x"])) # Execute function as a Spark vectorized UDF df.select(multiply(col("x"), col("x"))).show()
I am getting the following error which I am unable to find help on.
ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 139, in read_udfs arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 127, in read_single_udf return arg_offsets, wrap_scalar_pandas_udf(row_func, return_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 79, in wrap_scalar_pandas_udf arrow_return_type = to_arrow_type(return_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1613, in to_arrow_type import pyarrow as pa ModuleNotFoundError: No module named 'pyarrow' at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:171) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:90) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88) at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:131) at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:93) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2018-09-13 11:55:39 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 139, in read_udfs arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 127, in read_single_udf return arg_offsets, wrap_scalar_pandas_udf(row_func, return_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 79, in wrap_scalar_pandas_udf arrow_return_type = to_arrow_type(return_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1613, in to_arrow_type import pyarrow as pa ModuleNotFoundError: No module named 'pyarrow' at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:171) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:90) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88) at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:131) at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:93) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2018-09-13 11:55:39 ERROR TaskSetManager:70 - Task 0 in stage 0.0 failed 1 times; aborting job Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 350, in show print(self._jdf.showString(n, 20, vertical)) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o58.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 139, in read_udfs arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 127, in read_single_udf return arg_offsets, wrap_scalar_pandas_udf(row_func, return_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 79, in wrap_scalar_pandas_udf arrow_return_type = to_arrow_type(return_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1613, in to_arrow_type import pyarrow as pa ModuleNotFoundError: No module named 'pyarrow' at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:171) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:90) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88) at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:131) at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:93) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253) at org.apache.spark.sql.Dataset.head(Dataset.scala:2484) at org.apache.spark.sql.Dataset.take(Dataset.scala:2698) at org.apache.spark.sql.Dataset.showString(Dataset.scala:254) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 139, in read_udfs arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 127, in read_single_udf return arg_offsets, wrap_scalar_pandas_udf(row_func, return_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 79, in wrap_scalar_pandas_udf arrow_return_type = to_arrow_type(return_type) File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1613, in to_arrow_type import pyarrow as pa ModuleNotFoundError: No module named 'pyarrow' at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:171) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:90) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88) at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:131) at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:93) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more
More importantly, this works on my local machine. Will be grateful for any help that I can get on this. I have been stuck for a couple of days now.
-
Atul Dwivedi about 5 yearsIf you have any questions then those can be asked in comments. Try avoiding asking many questions in your answer and giving solution on the basis of guess.
-
spartacus about 5 yearsDo not remember how I solved it. But maybe there should be a warning message highlighting that an IDE version is not installed.
-
Steinhafen about 5 yearsFor the IDE issue there is no warning message, would be nice to have one through. As one just activated the IDE in the general settings but not from the conda environment, some of the packages have problems, like pyarrow and tqdm.
-
Chique_Code over 4 yearsHi, I think this would solve my problem as well. Would you mind telling me more about core node, and how would I install my pyarrow in that core node? Pandas is working in Jupyter notebook, but pyarrow gives me the same error all the time: "ModuleNotFoundError: No module named 'pyarrow''' when I ran pip3 install pyarrow and even done it with brew :brew install apache-arrow" The requrements already have been satisfied. So I'm kinda lost. I need to import this library to my J. notebook.
-
Bitswazsky over 4 yearsA bootstrap script is most likely going to solve the issue you're facing. Please check my answer here:
https://stackoverflow.com/a/57408712/4245859
-
Ivan almost 2 yearsFor me this works. I do not understand why this works. I was running my application in a virtual envinroment using Anaconda, but after pip install pyarrow, everything works fine.