Iceberg 基础:修订间差异

来自牛奶河Wiki
跳到导航 跳到搜索
无编辑摘要
第1行: 第1行:
Iceberg 表版本(V1和V2):
# V1 定义了如何使用不可变类型的文件(Parquet、ORC、AVRO)来管理大型分析型的表,包括元数据文件、属性、数据类型、表的模式,分区信息,以及如何写入与读取
# V2 在 V1 的基础上增加了如何通过这些类型的表实现行级别的更新与删除功能。其最主要的改变是引入了 delete file 记录需要删除的行数据,这样可以在不重写原有(数据)文件的前提下,实现行数据的更新与删除
一般来说:
# V1 表只支持增量数据插入,适合做纯增量写入场景,如埋点数据表
# V2 表支持行级更新,适合做状态变化的更新,如用户表、订单表
TBLPROPERTIES (
  'format-version'='2',
)
=== hive -> iceberg ===
=== hive -> iceberg ===


第30行: 第41行:
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 99.62 s     
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 99.62 s     
----------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------
INFO  : Completed executing command(queryId=hdfs_20241224111257_4f4363b0-d489-4bac-99d8-86451fa0a45c); Time taken: 99.725 seconds</nowiki></small>
INFO  : Completed executing command(queryId=hdfs_20241224111257_4f4363b0-d489-4bac-99d8-86451fa0a45c); Time taken: 99.725 seconds
+----------------+-----------+
|      col4      |    cs    |
+----------------+-----------+
| xeron x5 3708  | 40075712  |
+----------------+-----------+</nowiki></small>


=== iceberg ===
=== iceberg ===
第54行: 第70行:
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 22.68 s     
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 22.68 s     
----------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------
INFO  : Completed executing command(queryId=hdfs_20241224111143_d436171c-f9fe-4e9a-87e1-0e2ea9b48b1b); Time taken: 26.945 seconds
INFO  : Completed executing command(queryId=hdfs_20241224111143_d436171c-f9fe-4e9a-87e1-0e2ea9b48b1b); Time taken: 26.945 seconds</nowiki></small>
+----------------+-----------+
|      col4      |    cs    |
+----------------+-----------+
| xeron x5 3708  | 40075712  |
+----------------+-----------+</nowiki></small>
 





2024年12月24日 (二) 13:04的版本

Iceberg 表版本(V1和V2):

  1. V1 定义了如何使用不可变类型的文件(Parquet、ORC、AVRO)来管理大型分析型的表,包括元数据文件、属性、数据类型、表的模式,分区信息,以及如何写入与读取
  2. V2 在 V1 的基础上增加了如何通过这些类型的表实现行级别的更新与删除功能。其最主要的改变是引入了 delete file 记录需要删除的行数据,这样可以在不重写原有(数据)文件的前提下,实现行数据的更新与删除

一般来说:

  1. V1 表只支持增量数据插入,适合做纯增量写入场景,如埋点数据表
  2. V2 表支持行级更新,适合做状态变化的更新,如用户表、订单表
TBLPROPERTIES (
  'format-version'='2', 
)

hive -> iceberg

hive

# 分隔符为空格,字符串中有空格用双引号引起来
CREATE TABLE test1 (
  col1 STRING,
  col2 INT,
  col3 STRING,
  col4 STRING,
  col5 STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    "separatorChar" = " ",
    "quoteChar" = "\"",
    "serialization.encoding"="UTF-8"
)
STORED AS TEXTFILE;

# hdfs dfs -put test1.csv /tmp/data/
LOAD DATA INPATH '/tmp/data/test1.csv' OVERWRITE INTO TABLE test1;

select col4, count(*) cs from test1 group by col4 limit 10; 
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     30         30        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 99.62 s    
----------------------------------------------------------------------------------------------
INFO  : Completed executing command(queryId=hdfs_20241224111257_4f4363b0-d489-4bac-99d8-86451fa0a45c); Time taken: 99.725 seconds
+----------------+-----------+
|      col4      |    cs     |
+----------------+-----------+
| xeron x5 3708  | 40075712  |
+----------------+-----------+

iceberg

# 分区字段不能出现在建表字段中
CREATE TABLE test_ice (
  col1 STRING,
  col2 INT,
  col4 STRING,
  col5 STRING
)
PARTITIONED BY (col3 STRING)
STORED by iceberg;

insert into test_ice(col1, col2, col3, col4, col5) select col1, col2, col3, col4, col5 from test1;

select col4, count(*) cs from test_ice group by col4 limit 10; 
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      4          4        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 22.68 s    
----------------------------------------------------------------------------------------------
INFO  : Completed executing command(queryId=hdfs_20241224111143_d436171c-f9fe-4e9a-87e1-0e2ea9b48b1b); Time taken: 26.945 seconds