TypeError: got an unexpected keyword argument
63,618
You get an exception because UserDefinedFunction.__call__
supports only varargs and not keyword args.
def __call__(self, *cols):
sc = SparkContext._active_spark_context
jc = self._judf.apply(_to_seq(sc, cols, _to_java_column))
return Column(jc)
At much more basic level UDF can receive only Column
arguments, which will be expanded to their corresponding value on runtime, and not standard Python objects.
Personally I wouldn't use **kwargs
for this at all, but ignoring that you can achieve what you want by composing SQL expressions:
def flag_network_timeout_(**kwargs):
cond = (
(kwargs['this_network'] != kwargs['last_network']) |
(kwargs['this_campaign'] != kwargs['last_campaign']) |
(kwargs['this_adgroup'] != kwargs['last_adgroup']) |
(kwargs['this_creative'] != kwargs['last_creative']) |
(kwargs['time_diff'] > network_timeout))
return f.when(cond, 1).otherwise(0)
Author by
Nirmal
Wanderer, food hunter, can watch five movies in a stretch.
Updated on July 01, 2020Comments
-
Nirmal almost 4 years
The seemingly simple code below throws the following error:
Traceback (most recent call last): File "/home/nirmal/process.py", line 165, in <module> 'time_diff': f.last(adf['time_diff']).over(window_device_rows) TypeError: __call__() got an unexpected keyword argument 'this_campaign'
Code:
# Function to flag network timeouts def flag_network_timeout(**kwargs): if kwargs['this_network'] != kwargs['last_network'] \ or kwargs['this_campaign'] != kwargs['last_campaign'] \ or kwargs['this_adgroup'] != kwargs['last_adgroup'] \ or kwargs['this_creative'] != kwargs['last_creative'] \ or kwargs['time_diff'] > network_timeout: return 1 else: return 0 flag_network_timeout = f.udf(flag_network_timeout, IntegerType()) # Column spec to go over the device events and flag network resets network_timeout_flag = flag_network_timeout(**{ 'last_network': f.first(adf['network']).over(window_device_rows), 'last_campaign': f.first(adf['campaign']).over(window_device_rows), 'last_adgroup': f.first(adf['adgroup']).over(window_device_rows), 'last_creative': f.first(adf['creative']).over(window_device_rows), 'this_network': f.last(adf['network']).over(window_device_rows), 'this_campaign': f.last(adf['campaign']).over(window_device_rows), 'this_adgroup': f.last(adf['adgroup']).over(window_device_rows), 'this_creative': f.last(adf['creative']).over(window_device_rows), 'time_diff': f.last(adf['time_diff']).over(window_device_rows) }) # Update dataframe with the new columns adf = adf.select('*', network_timeout_flag.alias('network_timeout'))
What am I doing wrong please? Thank you.
-
Nirmal almost 8 years"because UserDefinedFunction.__call__ supports only varargs and not keyword args". Thank you. That was it! I don't think I would have captured it myself.
-
zero323 almost 8 yearsThe second part is much more important IMHO :) Whenever you can choose between UDF and composing SQL expression the former one should always be your first choice. UDFs, especially Python ones, have quite a few ugly properties.
-
Nirmal almost 8 yearsHonestly, I didn't get the second part. How different it is from the UDF?
-
zero323 almost 8 yearsOff the top of my head: UDF a) has to move data from off-heap storage b) move data from JVM to Python converting types on the way c) has to move data back to JVM once again using SerDe to convert types d) introduces nondeterministic breakpoint in the execution plan. e) doesn't benefit from codegen. The alternative solution (lets call it
flag_network_timeout_
) uses Python only to build SQL expression and limits Python code to driver only. -
Nirmal almost 8 yearsThis is great! I converted all my UDF to SQL expressions and ended up getting 22% increased performance (from 44 minutes to 34 minutes for the script to complete). Thanks for your very good help!
-
muon about 7 yearsthanks for pointing out
supports only varargs and not keyword args.
-
Coliban over 3 yearsI have no "kwargs" at all in my code but get the same error.