Spark SQL "Limit"

hadoop apache-spark hive hortonworks-data-platform

13,818

Sampling can be used in below ways :-

select ....from my_table TABLESAMPLE(.3 PERCENT)

or

select ....from my_table TABLESAMPLE(30M ROWS)

13,818

Author by

David H

Updated on June 19, 2022

Comments

David H almost 2 years
Env : spark 1.6 using Hadoop. Hortonworks Data Platform 2.5

I have a table with 10 billion records and I would like to get 300 million records and move them to a temporary table.
```
sqlContext.sql("select ....from my_table limit 300000000").repartition(50)
.write.saveAsTable("temporary_table")
```
I saw that the Limit keyword would actually make spark use only one executor!!! This means moving 300 million records to one node and writing it back to Hadoop. How can I avoid this reduce but still get just 300 million records while having more than one executor. I would like all nodes to write into hadoop.

Can sampling help me with that? If so how?

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Related

overwrite hive partitions using spark

Hive query failed on Tez DAG did not succeed due to VERTEX_FAILURE

merge multiple small files in to few larger files in Spark

Null check for Double/Int Value in Spark

Hive - How to print the classpath of a Hive service

kinit: Client's credentials have been revoked while getting initial credentials

How to configure Hive to use Spark?

Drop Hive external table WITHOUT removing data

parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file

How to make shark/spark clear the cache?