ModuleNotFoundError: No module named 'pyarrow'

20,266

Solution 1

I had the same problem in a AWS EMR. I tried the other answers but they didn't work.

The only solution worked for me was install pyarrow with pip instead conda.

pip install pyarrow

I really don't know why, but it can be useful if you have same problem.

Solution 2

Delete your __pycache__ files.

I had exactly the same problem and this solved it for me.

Share:
20,266
spartacus
Author by

spartacus

Updated on July 18, 2021

Comments

  • spartacus
    spartacus almost 3 years

    I am trying to run a simple pandas UDF example on my server. From here

    I have created a fresh environment just for the purpose of running this code.

    (PySparkEnv) $ conda list
    # packages in environment at /home/shekhar/.conda/envs/PySparkEnv:
    #
    # Name                    Version                   Build  Channel
    arrow-cpp                 0.10.0           py36h70250a7_0    conda-forge
    blas                      1.0                         mkl  
    boost-cpp                 1.67.0               h3a22d5f_0    conda-forge
    bzip2                     1.0.6                h470a237_2    conda-forge
    ca-certificates           2018.8.24            ha4d7672_0    conda-forge
    certifi                   2018.8.24                py36_1    conda-forge
    icu                       58.2                 hfc679d8_0    conda-forge
    intel-openmp              2019.0                      117  
    libffi                    3.2.1                hfc679d8_5    conda-forge
    libgcc-ng                 7.2.0                hdf63c60_3    conda-forge
    libgfortran-ng            7.2.0                hdf63c60_3    conda-forge
    libstdcxx-ng              7.2.0                hdf63c60_3    conda-forge
    mkl                       2019.0                      117  
    mkl_fft                   1.0.6                    py36_0    conda-forge
    mkl_random                1.0.1                    py36_0    conda-forge
    ncurses                   6.1                  hfc679d8_1    conda-forge
    numpy                     1.15.0           py36h1b885b7_0  
    numpy-base                1.15.0           py36h3dfced4_0  
    openssl                   1.0.2p               h470a237_0    conda-forge
    pandas                    0.23.4           py36hf8a1672_0    conda-forge
    parquet-cpp               1.5.0.pre            h83d4a3d_0    conda-forge
    pip                       18.0                     py36_1    conda-forge
    py4j                      0.10.7                     py_1    conda-forge
    pyarrow                   0.10.0           py36hfc679d8_0    conda-forge
    pyspark                   2.3.1                    py36_1    conda-forge
    python                    3.6.6                h5001a0f_0    conda-forge
    python-dateutil           2.7.3                      py_0    conda-forge
    pytz                      2018.5                     py_0    conda-forge
    readline                  7.0                  haf1bffa_1    conda-forge
    setuptools                40.2.0                   py36_0    conda-forge
    six                       1.11.0                   py36_1    conda-forge
    sqlite                    3.24.0               h2f33b56_1    conda-forge
    tk                        8.6.8                         0    conda-forge
    wheel                     0.31.1                   py36_1    conda-forge
    xz                        5.2.4                h470a237_1    conda-forge
    zlib                      1.2.11               h470a237_3    conda-forge
    

    Then I run the following code:

    from pyspark import SparkContext
    from pyspark.sql import SparkSession
    
    from pyspark.sql.dataframe import DataFrame
    from pyspark.sql.types import *
    from pyspark.sql.functions import col, pandas_udf, PandasUDFType
    import pandas as pd
    import os
    os.environ['PYSPARK_PYTHON'] = '/usr/local/anaconda3/bin/python3'
    SparkContext.setSystemProperty('spark.executor.memory', '30g')
    SparkContext.setSystemProperty('spark.executor.cores', '5')
    spark = SparkSession.builder.appName("Python Spark SQL basic example").getOrCreate()
    
    # Declare the function and create the UDF
    def multiply_func(a, b):
        return a * b
    
    multiply = pandas_udf(multiply_func, returnType=LongType())
    
    # The function for a pandas_udf should be able to execute with local Pandas data
    x = pd.Series([1, 2, 3])
    print(multiply_func(x, x))
    # 0    1
    # 1    4
    # 2    9
    # dtype: int64
    
    # Create a Spark DataFrame, 'spark' is an existing SparkSession
    df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
    
    # Execute function as a Spark vectorized UDF
    df.select(multiply(col("x"), col("x"))).show()
    

    I am getting the following error which I am unable to find help on.

    ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
    org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main
        func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 139, in read_udfs
        arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 127, in read_single_udf
        return arg_offsets, wrap_scalar_pandas_udf(row_func, return_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 79, in wrap_scalar_pandas_udf
        arrow_return_type = to_arrow_type(return_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1613, in to_arrow_type
        import pyarrow as pa
    ModuleNotFoundError: No module named 'pyarrow'
    
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
        at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:171)
        at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:90)
        at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88)
        at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:131)
        at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:93)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
    2018-09-13 11:55:39 WARN  TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main
        func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 139, in read_udfs
        arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 127, in read_single_udf
        return arg_offsets, wrap_scalar_pandas_udf(row_func, return_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 79, in wrap_scalar_pandas_udf
        arrow_return_type = to_arrow_type(return_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1613, in to_arrow_type
        import pyarrow as pa
    ModuleNotFoundError: No module named 'pyarrow'
    
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
        at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:171)
        at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:90)
        at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88)
        at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:131)
        at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:93)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
    
    2018-09-13 11:55:39 ERROR TaskSetManager:70 - Task 0 in stage 0.0 failed 1 times; aborting job
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 350, in show
        print(self._jdf.showString(n, 20, vertical))
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
        answer, self.gateway_client, self.target_id, self.name)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
        return f(*a, **kw)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
        format(target_id, ".", name), value)
    py4j.protocol.Py4JJavaError: An error occurred while calling o58.showString.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main
        func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 139, in read_udfs
        arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 127, in read_single_udf
        return arg_offsets, wrap_scalar_pandas_udf(row_func, return_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 79, in wrap_scalar_pandas_udf
        arrow_return_type = to_arrow_type(return_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1613, in to_arrow_type
        import pyarrow as pa
    ModuleNotFoundError: No module named 'pyarrow'
    
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
        at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:171)
        at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:90)
        at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88)
        at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:131)
        at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:93)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
    
    Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)
        at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
        at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
        at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
        at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
    Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main
        func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 139, in read_udfs
        arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 127, in read_single_udf
        return arg_offsets, wrap_scalar_pandas_udf(row_func, return_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 79, in wrap_scalar_pandas_udf
        arrow_return_type = to_arrow_type(return_type)
      File "/home/shekhar/.conda/envs/PySparkEnv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1613, in to_arrow_type
        import pyarrow as pa
    ModuleNotFoundError: No module named 'pyarrow'
    
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
        at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:171)
        at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:121)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:90)
        at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88)
        at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:131)
        at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:93)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:800)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        ... 1 more
    

    More importantly, this works on my local machine. Will be grateful for any help that I can get on this. I have been stuck for a couple of days now.

  • Atul Dwivedi
    Atul Dwivedi about 5 years
    If you have any questions then those can be asked in comments. Try avoiding asking many questions in your answer and giving solution on the basis of guess.
  • spartacus
    spartacus about 5 years
    Do not remember how I solved it. But maybe there should be a warning message highlighting that an IDE version is not installed.
  • Steinhafen
    Steinhafen about 5 years
    For the IDE issue there is no warning message, would be nice to have one through. As one just activated the IDE in the general settings but not from the conda environment, some of the packages have problems, like pyarrow and tqdm.
  • Chique_Code
    Chique_Code over 4 years
    Hi, I think this would solve my problem as well. Would you mind telling me more about core node, and how would I install my pyarrow in that core node? Pandas is working in Jupyter notebook, but pyarrow gives me the same error all the time: "ModuleNotFoundError: No module named 'pyarrow''' when I ran pip3 install pyarrow and even done it with brew :brew install apache-arrow" The requrements already have been satisfied. So I'm kinda lost. I need to import this library to my J. notebook.
  • Bitswazsky
    Bitswazsky over 4 years
    A bootstrap script is most likely going to solve the issue you're facing. Please check my answer here: https://stackoverflow.com/a/57408712/4245859
  • Ivan
    Ivan almost 2 years
    For me this works. I do not understand why this works. I was running my application in a virtual envinroment using Anaconda, but after pip install pyarrow, everything works fine.