TypeError: got an unexpected keyword argument

63,618

You get an exception because UserDefinedFunction.__call__ supports only varargs and not keyword args.

def __call__(self, *cols):
    sc = SparkContext._active_spark_context
    jc = self._judf.apply(_to_seq(sc, cols, _to_java_column))
    return Column(jc)

At much more basic level UDF can receive only Column arguments, which will be expanded to their corresponding value on runtime, and not standard Python objects.

Personally I wouldn't use **kwargs for this at all, but ignoring that you can achieve what you want by composing SQL expressions:

def flag_network_timeout_(**kwargs):

    cond = (
        (kwargs['this_network'] != kwargs['last_network']) |
        (kwargs['this_campaign'] != kwargs['last_campaign']) |
        (kwargs['this_adgroup'] != kwargs['last_adgroup']) |
        (kwargs['this_creative'] != kwargs['last_creative']) |
        (kwargs['time_diff'] > network_timeout))

    return f.when(cond, 1).otherwise(0)
Share:
63,618
Nirmal
Author by

Nirmal

Wanderer, food hunter, can watch five movies in a stretch.

Updated on July 01, 2020

Comments

  • Nirmal
    Nirmal almost 4 years

    The seemingly simple code below throws the following error:

    Traceback (most recent call last):
      File "/home/nirmal/process.py", line 165, in <module>
        'time_diff': f.last(adf['time_diff']).over(window_device_rows)
    TypeError: __call__() got an unexpected keyword argument 'this_campaign'
    

    Code:

    # Function to flag network timeouts
    def flag_network_timeout(**kwargs):
        if kwargs['this_network'] != kwargs['last_network'] \
                or kwargs['this_campaign'] != kwargs['last_campaign'] \
                or kwargs['this_adgroup'] != kwargs['last_adgroup'] \
                or kwargs['this_creative'] != kwargs['last_creative'] \
                or kwargs['time_diff'] > network_timeout:
            return 1
        else:
            return 0
    flag_network_timeout = f.udf(flag_network_timeout, IntegerType())
    
    # Column spec to go over the device events and flag network resets
    network_timeout_flag = flag_network_timeout(**{
        'last_network': f.first(adf['network']).over(window_device_rows),
        'last_campaign': f.first(adf['campaign']).over(window_device_rows),
        'last_adgroup': f.first(adf['adgroup']).over(window_device_rows),
        'last_creative': f.first(adf['creative']).over(window_device_rows),
        'this_network': f.last(adf['network']).over(window_device_rows),
        'this_campaign': f.last(adf['campaign']).over(window_device_rows),
        'this_adgroup': f.last(adf['adgroup']).over(window_device_rows),
        'this_creative': f.last(adf['creative']).over(window_device_rows),
        'time_diff': f.last(adf['time_diff']).over(window_device_rows)
    })
    
    # Update dataframe with the new columns
    adf = adf.select('*', network_timeout_flag.alias('network_timeout'))
    

    What am I doing wrong please? Thank you.

  • Nirmal
    Nirmal almost 8 years
    "because UserDefinedFunction.__call__ supports only varargs and not keyword args". Thank you. That was it! I don't think I would have captured it myself.
  • zero323
    zero323 almost 8 years
    The second part is much more important IMHO :) Whenever you can choose between UDF and composing SQL expression the former one should always be your first choice. UDFs, especially Python ones, have quite a few ugly properties.
  • Nirmal
    Nirmal almost 8 years
    Honestly, I didn't get the second part. How different it is from the UDF?
  • zero323
    zero323 almost 8 years
    Off the top of my head: UDF a) has to move data from off-heap storage b) move data from JVM to Python converting types on the way c) has to move data back to JVM once again using SerDe to convert types d) introduces nondeterministic breakpoint in the execution plan. e) doesn't benefit from codegen. The alternative solution (lets call it flag_network_timeout_) uses Python only to build SQL expression and limits Python code to driver only.
  • Nirmal
    Nirmal almost 8 years
    This is great! I converted all my UDF to SQL expressions and ended up getting 22% increased performance (from 44 minutes to 34 minutes for the script to complete). Thanks for your very good help!
  • muon
    muon about 7 years
    thanks for pointing out supports only varargs and not keyword args.
  • Coliban
    Coliban over 3 years
    I have no "kwargs" at all in my code but get the same error.