Iceberg 基础:修订间差异

来自牛奶河Wiki
跳到导航 跳到搜索
无编辑摘要
第72行: 第72行:
INFO  : Completed executing command(queryId=hdfs_20241224111143_d436171c-f9fe-4e9a-87e1-0e2ea9b48b1b); Time taken: 26.945 seconds</nowiki></small>
INFO  : Completed executing command(queryId=hdfs_20241224111143_d436171c-f9fe-4e9a-87e1-0e2ea9b48b1b); Time taken: 26.945 seconds</nowiki></small>


==== 资源 ====
==== 运行时 ====
  <small><nowiki># vmstat 1
  <small><nowiki># vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----

2024年12月24日 (二) 14:08的版本

Iceberg 表版本(V1和V2):

  1. V1 定义了如何使用不可变类型的文件(Parquet、ORC、AVRO)来管理大型分析型的表,包括元数据文件、属性、数据类型、表的模式,分区信息,以及如何写入与读取
  2. V2 在 V1 的基础上增加了如何通过这些类型的表实现行级别的更新与删除功能。其最主要的改变是引入了 delete file 记录需要删除的行数据,这样可以在不重写原有(数据)文件的前提下,实现行数据的更新与删除

一般来说:

  1. V1 表只支持增量数据插入,适合做纯增量写入场景,如埋点数据表
  2. V2 表支持行级更新,适合做状态变化的更新,如用户表、订单表
TBLPROPERTIES (
  'format-version'='2', 
)

hive -> iceberg

hive

# 分隔符为空格,字符串中有空格用双引号引起来
CREATE TABLE test1 (
  col1 STRING,
  col2 INT,
  col3 STRING,
  col4 STRING,
  col5 STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    "separatorChar" = " ",
    "quoteChar" = "\"",
    "serialization.encoding"="UTF-8"
)
STORED AS TEXTFILE;

# hdfs dfs -put test1.csv /tmp/data/
LOAD DATA INPATH '/tmp/data/test1.csv' OVERWRITE INTO TABLE test1;

select col4, count(*) cs from test1 group by col4 limit 10; 
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED     30         30        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 99.62 s    
----------------------------------------------------------------------------------------------
INFO  : Completed executing command(queryId=hdfs_20241224111257_4f4363b0-d489-4bac-99d8-86451fa0a45c); Time taken: 99.725 seconds
+----------------+-----------+
|      col4      |    cs     |
+----------------+-----------+
| xeron x5 3708  | 40075712  |
+----------------+-----------+

iceberg

# 分区字段不能出现在建表字段中
CREATE TABLE test_ice (
  col1 STRING,
  col2 INT,
  col4 STRING,
  col5 STRING
)
PARTITIONED BY (col3 STRING)
STORED by iceberg;

insert into test_ice(col1, col2, col3, col4, col5) select col1, col2, col3, col4, col5 from test1;

select col4, count(*) cs from test_ice group by col4 limit 10; 
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      4          4        0        0       0       0  
Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 22.68 s    
----------------------------------------------------------------------------------------------
INFO  : Completed executing command(queryId=hdfs_20241224111143_d436171c-f9fe-4e9a-87e1-0e2ea9b48b1b); Time taken: 26.945 seconds

运行时

# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
24  2      0 4735832   1816 6360404    0    0     3    25    4   23  2  1 97  0  0
31  0      0 4675836   1816 6360596    0    0     0    41 2591 3733 85 15  0  0  0
44  0      0 4615600   1816 6360656    0    0     0    80 2943 5270 88 11  1  0  0
65  0      0 4531204   1816 6360548    0    0     0    24 3148 4973 86 14  1  0  0
38  0      0 4477960   1816 6360580    0    0     0    28 2751 6194 88 12  0  0  0
...
 1  3      0 3117460   1708 7576584    0    0     0     0 2368 3685  1  1 49 49  0
 0  3      0 3117144   1708 7576596    0    0     0 16384 2550 3815  1  1 50 49  0
...
 0  0      0 3062816   1708 8845520    0    0     0   108 4263 7792  3  1 95  1  0
 0  0      0 3062804   1708 8845684    0    0     0   100 4112 8209  2  1 97  1  0

# top
Tasks: 259 total,   2 running, 257 sleeping,   0 stopped,   0 zombie
%Cpu(s): 92.3 us,  6.7 sy,  0.0 ni,  0.1 id,  0.0 wa,  0.0 hi,  0.9 si,  0.0 st
KiB Mem : 16266524 total,   273132 free,  9388948 used,  6604444 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  6373620 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                        
23006 hdfs      20   0 2536576 551248  25184 S  57.8  3.4   0:28.10 java
22892 hdfs      20   0 2507524 491888  25188 S  55.8  3.0   0:28.05 java
23020 hdfs      20   0 2534372 636328  25168 S  55.1  3.9   0:23.80 java
22887 hdfs      20   0 2532756 646060  25176 S  46.8  4.0   0:27.88 java
22895 hdfs      20   0 2537776 662796  25184 S  40.9  4.1   0:30.90 java
22959 hdfs      20   0 2506228 502380  25180 S  40.5  3.1   0:28.43 java
22897 hdfs      20   0 2507996 486764  25188 S  39.5  3.0   0:28.39 java
23004 hdfs      20   0 2504672 652876  25176 S  35.5  4.0   0:30.46 java