admin 管理员组文章数量: 1086019
Running:
hdfs dfs -du -h /user/hive/warehouse/customers/partition_dt=2025-03-27
returns:
1.0G 3.0G /user/hive/warehouse/customers/partition_dt=2025-03-27
To compact this, I tried: 1.Copying the data to another partition (partition_dt=2025-03-27+1000 days) 2.Using INSERT OVERWRITE to reload it back into partition_dt=2025-03-27 However, this approach fails when the partition has too many files. Some files are so large that compaction does not complete successfully.
How can I efficiently merge small files in this partition without causing failures?
Solution 1: Use Hive Major Compaction (For ACID Tables)
ALTER TABLE customers PARTITION (partition_dt='2025-03-27') COMPACT 'MAJOR';
Solution 2: Use INSERT OVERWRITE with Merge Settings (For Non-ACID Tables)
SET hive.merge.mapfiles=true;
SET hive.merge.size.per.task=256000000;
SET hive.merge.smallfiles.avgsize=128000000;
INSERT OVERWRITE TABLE customers PARTITION (partition_dt='2025-03-27') SELECT * FROM customers WHERE partition_dt='2025-03-27' DISTRIBUTE BY partition_dt;
Solution 3: Use CREATE TABLE AS SELECT (CTAS) and Rename
CREATE TABLE customers_tmp STORED AS ORC AS SELECT * FROM customers WHERE partition_dt='2025-03-27';
ALTER TABLE customers DROP PARTITION (partition_dt='2025-03-27');
INSERT OVERWRITE TABLE customers PARTITION (partition_dt='2025-03-27') SELECT * FROM customers_tmp;
DROP TABLE customers_tmp;
Solution 4: Use HDFS distcp to Merge Small Files
hadoop fs -mkdir /tmp/customers_merged
hadoop distcp -Ddfs.replication=1 -blocksperchunk 32 \ /user/hive/warehouse/customers/partition_dt=2025-03-27 /tmp/customers_merged
Then, load back into Hive:
LOAD DATA INPATH '/tmp/customers_merged' OVERWRITE INTO TABLE customers PARTITION (partition_dt='2025-03-27');
Solution 5: Use Spark to Read and Write Larger Files
from pyspark.sql import SparkSession
`spark = SparkSession.builder.appName("HiveCompaction").enableHiveSupport().getOrCreate()
df = spark.sql("SELECT * FROM customers WHERE partition_dt='2025-03-27'")
df.coalesce(10).write.mode("overwrite").format("parquet").saveAsTable("customers_compacted")`
Then, overwrite the partition in Hive:
INSERT OVERWRITE TABLE customers PARTITION (partition_dt='2025-03-27') SELECT * FROM customers_compacted;
Which method should I use for large-scale Hive table compaction?
本文标签:
版权声明:本文标题:apache spark - I have a Hive table with multiple partitions, and in one of the partitions, I have nearly 200,000 small files, wh 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.roclinux.cn/p/1744042312a2523483.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论