首页技术日记正文内容

apache spark - I have a Hive table with multiple partitions, and in one of the partitions, I have nearly 200,000 small files, wh

技术日记

更新时间：2025-07-01 04:20:12 59

admin 管理员组

文章数量: 1087652

Running: hdfs dfs -du -h /user/hive/warehouse/customers/partition_dt=2025-03-27

returns: 1.0G 3.0G /user/hive/warehouse/customers/partition_dt=2025-03-27

To compact this, I tried: 1.Copying the data to another partition (partition_dt=2025-03-27+1000 days) 2.Using INSERT OVERWRITE to reload it back into partition_dt=2025-03-27 However, this approach fails when the partition has too many files. Some files are so large that compaction does not complete successfully.

How can I efficiently merge small files in this partition without causing failures?

Solution 1: Use Hive Major Compaction (For ACID Tables) ALTER TABLE customers PARTITION (partition_dt='2025-03-27') COMPACT 'MAJOR';

Solution 2: Use INSERT OVERWRITE with Merge Settings (For Non-ACID Tables) SET hive.merge.mapfiles=true; SET hive.merge.size.per.task=256000000; SET hive.merge.smallfiles.avgsize=128000000;

INSERT OVERWRITE TABLE customers PARTITION (partition_dt='2025-03-27') SELECT * FROM customers WHERE partition_dt='2025-03-27' DISTRIBUTE BY partition_dt;

Solution 3: Use CREATE TABLE AS SELECT (CTAS) and Rename CREATE TABLE customers_tmp STORED AS ORC AS SELECT * FROM customers WHERE partition_dt='2025-03-27';

ALTER TABLE customers DROP PARTITION (partition_dt='2025-03-27');

INSERT OVERWRITE TABLE customers PARTITION (partition_dt='2025-03-27') SELECT * FROM customers_tmp;

DROP TABLE customers_tmp;

Solution 4: Use HDFS distcp to Merge Small Files hadoop fs -mkdir /tmp/customers_merged hadoop distcp -Ddfs.replication=1 -blocksperchunk 32 \ /user/hive/warehouse/customers/partition_dt=2025-03-27 /tmp/customers_merged

Then, load back into Hive: LOAD DATA INPATH '/tmp/customers_merged' OVERWRITE INTO TABLE customers PARTITION (partition_dt='2025-03-27');

Solution 5: Use Spark to Read and Write Larger Files from pyspark.sql import SparkSession

`spark = SparkSession.builder.appName("HiveCompaction").enableHiveSupport().getOrCreate()

df = spark.sql("SELECT * FROM customers WHERE partition_dt='2025-03-27'")

df.coalesce(10).write.mode("overwrite").format("parquet").saveAsTable("customers_compacted")`

Then, overwrite the partition in Hive: INSERT OVERWRITE TABLE customers PARTITION (partition_dt='2025-03-27') SELECT * FROM customers_compacted;

Which method should I use for large-scale Hive table compaction?

本文标签：

Error[2]: Invalid argument supplied for foreach(), File: /www/wwwroot/roclinux.cn/tmp/view_template_quzhiwa_htm_read.htm, Line: 58

File: /www/wwwroot/roclinux.cn/tmp/route_read.php, Line: 205, include(/www/wwwroot/roclinux.cn/tmp/view_template_quzhiwa_htm_read.htm)
File: /www/wwwroot/roclinux.cn/tmp/index.inc.php, Line: 129, include(/www/wwwroot/roclinux.cn/tmp/route_read.php)
File: /www/wwwroot/roclinux.cn/index.php, Line: 29, include(/www/wwwroot/roclinux.cn/tmp/index.inc.php)

版权声明：本文标题：apache spark - I have a Hive table with multiple partitions, and in one of the partitions, I have nearly 200,000 small files, wh 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.roclinux.cn/p/1744042312a2523483.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

Error[2]: array_keys() expects parameter 1 to be array, null given, File: /www/wwwroot/roclinux.cn/tmp/view_template_quzhiwa_htm_read.htm, Line: 77

File: /www/wwwroot/roclinux.cn/tmp/view_template_quzhiwa_htm_read.htm, Line: 77, array_keys()
File: /www/wwwroot/roclinux.cn/tmp/route_read.php, Line: 205, include(/www/wwwroot/roclinux.cn/tmp/view_template_quzhiwa_htm_read.htm)
File: /www/wwwroot/roclinux.cn/tmp/index.inc.php, Line: 129, include(/www/wwwroot/roclinux.cn/tmp/route_read.php)
File: /www/wwwroot/roclinux.cn/index.php, Line: 29, include(/www/wwwroot/roclinux.cn/tmp/index.inc.php)

更多相关文章

中国移动中兴ZTE F623光猫设置自动拨号启用wifi功

编程

17天前

中国移动光纤宽带，使用的光猫为中兴ZTE F623型号，带路由功能和WIFI。若想使用F623的路由和WIFI功能，可以参考下面中国移动中兴ZTE F623光猫设置自动拨号启

Linux：深入理解网络层

编程

17天前

网络层在复杂的网络环境中确定一个合适的路径.传输到指定的网络中一、网络层的理解问题1：为什么要有网络层的概念呢？？ ——>我们先来讲一个故事&#xff

斐讯K2路由器华硕固件刷机教程

编程

17天前

本文还有配套的精品资源，点击获取简介：斐讯K2是一款性能强大且具有可玩性的无线路由器，用户可以通过刷机升级到华硕固件以增强功能和提升网络性能。本资料包包括刷机过程和华

华为交换机VLAN配置实战指南.zip

编程

17天前

本文还有配套的精品资源，点击获取简介：在企业网络管理中，华为交换机的VLAN配置至关重要，涉及创建、端口分配、路由实现、Trunk配置等步骤。本

telnet-0.17-64.el7.x86_64.rpm安装包：轻松搭建CentOS7和RHEL7的telnet客户端

编程

17天前

telnet-0.17-64.el7.x86_64.rpm安装包：轻松搭建CentOS7和RHEL7的telnet客户端【下载地址】telnet-0.17-64.el7.x86_64.rpm安装包介绍这是一个专为C

Google hacking语法大全

编程

17天前

01 Google hacking inurl:Login 将返回url中含有Login的网页 intitle:后台登录管理员将返回含有管理员后台的网页 intext:后台登录将返回含有后台的网页 inurl:adminlogi

从入门到精通！网络安全必学的跳板攻击防御技术全解析

编程

17天前

黑客在进行攻击时会借用其他系统来达到自己的目的，如对下一目标的攻击和被侵占计算机本身的利用等等。本文介绍了常见的黑客对被侵占计算机的使用方式和安全管理员相应的应对方法。黑客进行网络攻击时，除了自

掌握mstsc突破远程桌面连接限制的技巧

编程

17天前

本文还有配套的精品资源，点击获取简介：本文详细探讨了在Windows系统中突破mstsc远程桌面连接限制的方法。mstsc是实现远程桌面连接的内置工具，可能会因安全策

渗透测试行业术语1

编程

17天前

渗透测试行业术语1 1. 肉鸡所谓“肉鸡”是一种很形象的比喻，比喻那些可以随意被我们控制的电脑，对方可以是 WINDOWS 系统，也可以是 UNIXLINUX 系统可

【25软考网工】第六章（7）网络安全防护系统

编程

17天前

博客主页：christine-rr-CSDN博客专栏主页：软考中级网络工程师笔记大家好，我是christine-rr !目前《软考中级网络工程师》专栏已经更

网络工程师必备资料与实践指南

编程

17天前

本文还有配套的精品资源，点击获取简介：为应对IT行业的技术挑战，本资源包集学习笔记、练习题库和历年真题于一身，旨在帮助网络工程师考生全面系统地学

DRCOM网络代理共享工具

编程

17天前

本文还有配套的精品资源，点击获取简介：代理软件能够实现在DRCOM网络环境下，通过单一账号允许多人同时上网。这类软件的目的是提供账号共享，绕过对

Java实现SNMP网络设备MIB信息采集系统设计与源代码分析

编程

17天前

本文还有配套的精品资源，点击获取简介：本项目是一个IT计算机领域的毕业论文设计，专注于使用Java语言开发一个SNMP客户端来采集网络设备的MIB信息。SNMP协议允

openstack(train)创建网络与虚拟机

编程

17天前

系列文章目录 openstack(train)单机版安装教程 openstack(train)创建网络与虚拟机文章目录系列文章目录Openstack Neutron原理1.Neutron架构2.外网访问原理基于Dashboard创

ThinkBook 16p笔记本系统重装全流程指南

编程

17天前

ThinkBook 16p笔记本系统重装全流程指南引言 ThinkBook 16p作为联想旗下主打性能的商务创作本，凭借其标压处理器与专业显卡的组合，成为众多开发者与设计师的首选设备。但在长期使用过程中，系统卡顿、软件冲突等问题难以避

win11登录密码忘记了？别慌！无需重装系统，一个U盘轻松移除！

编程

17天前

我们在使用电脑的时候偶尔会遇到忘记windows登录密码的情况，其实不用担心，解决这个问题的方法很简单，只需要一个闲置的U盘，就能轻松移除。接下来&am

Windows系统更新，显示Windows启动管理器，进去后为重装系统界面的解决方法。

编程

17天前

背景：最近win10系统一直提醒自己更新系统，点击更新win11后出现Windows启动管理器，无论选择它给出的那个操作系统最后都会进入到重装系统界面（

【免费下载】重温经典：Windows 98原版系统镜像下载资源推荐

编程

11天前

重温经典：Windows 98原版系统镜像下载资源推荐【下载地址】最全Windows98原版系统镜像下载资源本文档为您提供了一份详尽的指南，旨在帮助您获取并了解Windows98这一经典操作系

【免费下载】大神U盘工具（Win10PE）UEFI纯净版启动盘制作工具

编程

11天前

大神U盘工具（Win10PE）UEFI纯净版启动盘制作工具【下载地址】大神U盘工具Win10PEUEFI纯净版启动盘制作工具分享大神U盘工具是一款专为Windows 10 PE系统设计的UEF

【免费下载】重温经典：MSDN原版Windows 7 with SP1各版本下载推荐

编程

11天前

重温经典：MSDN原版Windows 7 with SP1各版本下载推荐【下载地址】MSDN原版Windows7withSP1各版本下载 - **版本名称**: Windows 7 Ultimate with Ser

发表评论

全部评论 0

暂无评论

Linux大棚 – 不忘初心的技术博客，浮躁时代的安静角落

apache spark - I have a Hive table with multiple partitions, and in one of the partitions, I have nearly 200,000 small files, wh

更多相关文章

中国移动中兴ZTE F623光猫设置自动拨号启用wifi功

Linux：深入理解网络层

斐讯K2路由器华硕固件刷机教程

华为交换机VLAN配置实战指南.zip

telnet-0.17-64.el7.x86_64.rpm安装包：轻松搭建CentOS7和RHEL7的telnet客户端

​Google hacking语法大全

从入门到精通！网络安全必学的跳板攻击防御技术全解析

掌握mstsc突破远程桌面连接限制的技巧

渗透测试行业术语1

【25软考网工】第六章（7）网络安全防护系统

网络工程师必备资料与实践指南

DRCOM网络代理共享工具

Java实现SNMP网络设备MIB信息采集系统设计与源代码分析

openstack(train)创建网络与虚拟机

ThinkBook 16p笔记本系统重装全流程指南

win11登录密码忘记了？别慌！无需重装系统，一个U盘轻松移除！

Windows系统更新，显示Windows启动管理器，进去后为重装系统界面的解决方法。

【免费下载】 重温经典：Windows 98原版系统镜像下载资源推荐

【免费下载】 大神U盘工具（Win10PE）UEFI纯净版启动盘制作工具

【免费下载】 重温经典：MSDN原版Windows 7 with SP1各版本下载推荐

发表评论

推荐文章

javascript - SyntaxError: Unexpected token e in JSON at position 1 - Stack Overflow

how to pass value to pop up window using javascript function? - Stack Overflow

WIN10 专业版

windows串口通信函数API

【免费下载】 重温经典：Windows 98原版系统镜像下载资源推荐

热门文章

javascript - Search only a specific field in ui.bootstrap typeahead - Stack Overflow

javascript - how to execute jquery code one by one? - Stack Overflow

javascript - webpack error — sockjs-node connection refused - Stack Overflow

javascript - Updating deeply nested state with useState not working properly - Stack Overflow

javascript - How am I supposed to use the &quot;pdf&quot; package from typescript - Stack Overflow

javascript - How to remove item of kendo dropDownList if it is the last one? - Stack Overflow

jquery - Showhide div based on selected option value in Javascript - Stack Overflow

javascript - How do I get the address of the connected wallet with web3modal? - Stack Overflow

惠普台式机Win7升级Win10全攻略

windows通过iscsi挂载linux硬盘

最新文章

javascript - How do I toggle the readonly attribute of all child element with jquery - Stack Overflow

javascript - Might it be possible to block an entire US state from accessing my site, using PHP? - Stack Overflow

c++ - Is dereferencing std::span::end always undefined? - Stack Overflow

javascript - Delay function execution if it has been called recently - Stack Overflow

javascript - Google Maps Autocomplete List - Stack Overflow

【免费下载】 重温经典：MSDN原版Windows 7 with SP1各版本下载推荐

【免费下载】 大神U盘工具（Win10PE）UEFI纯净版启动盘制作工具

【免费下载】 重温经典：Windows 98原版系统镜像下载资源推荐

Windows系统更新，显示Windows启动管理器，进去后为重装系统界面的解决方法。

win11登录密码忘记了？别慌！无需重装系统，一个U盘轻松移除！

Exploring the Finest Accommodations: A Comprehensive Guide to Ruston LA Hotels

The Enchanting Experience of ScaliniTella NYC: A Culinary Gem in the Heart of Manhattan

Exploring the Exquisite Aloft Chicago O'Hare: A Blend of Modern Luxury and Convenience

A Culinary Journey: Discovering the Finest Dining Experiences in Waco, TX

A Culinary Journey: Discovering the Finest Dining Experiences in Athens, GA

Google hacking语法大全

【免费下载】重温经典：Windows 98原版系统镜像下载资源推荐

【免费下载】大神U盘工具（Win10PE）UEFI纯净版启动盘制作工具

【免费下载】重温经典：MSDN原版Windows 7 with SP1各版本下载推荐

【免费下载】重温经典：Windows 98原版系统镜像下载资源推荐

javascript - How am I supposed to use the "pdf" package from typescript - Stack Overflow

【免费下载】重温经典：MSDN原版Windows 7 with SP1各版本下载推荐

【免费下载】大神U盘工具（Win10PE）UEFI纯净版启动盘制作工具

【免费下载】重温经典：Windows 98原版系统镜像下载资源推荐