Version: Next

HDFS

Overview

HDFS uses the general capabilities of flink's fileSystem to support single files and partitioned files. The file system connector itself is included in Flink and does not require an additional dependency. The corresponding jar can be found in the Flink distribution inside the /lib directory. A corresponding format needs to be specified for reading and writing rows from and to a file system.

How to create a HDFS Load Node

Usage for SQL API

The example below shows how to create a HDFS Load Node with Flink SQL Cli :

CREATE TABLE hdfs_load_node (
  id STRING,
  name STRING,
  uv BIGINT,
  pv BIGINT,
  dt STRING,
 `hour` STRING
  ) PARTITIONED BY (dt, `hour`) WITH (
    'connector'='filesystem-inlong',
    'path'='...',
    'format'='orc',
    'sink.partition-commit.delay'='1 h',
    'sink.partition-commit.policy.kind'='success-file'
  );

File Formats

CSV(Uncompressed)
JSON(JSON format for file system connector is not a typical JSON file but uncompressed newline delimited JSON.)
Avro(Support compression by configuring avro.codec.)
Parquet(Compatible with Hive.)
Orc(Compatible with Hive.)
Debezium-JSON
Canal-JSON
Raw

Rolling Policy

Data within the partition directories are split into part files. Each partition will contain at least one part file for each subtask of the sink that has received data for that partition. The in-progress part file will be closed and additional part file will be created according to the configurable rolling policy. The policy rolls part files based on size, a timeout that specifies the maximum duration for which a file can be open.

Option	Default	Type	Description
sink.rolling-policy.file-size	128MB	MemorySize	The maximum part file size before rolling.
sink.rolling-policy.rollover-interval	30 min	String	The maximum time duration a part file can stay open before rolling (by default 30 min to avoid to many small files). The frequency at which this is checked is controlled by the `sink.rolling-policy.check-interval` option.
sink.rolling-policy.check-interval	1 min	String	The interval for checking time based rolling policies. This controls the frequency to check whether a part file should rollover based on `sink.rolling-policy.rollover-interval`.

File Compaction

The file sink supports file compactions, which allows applications to have smaller checkpoint intervals without generating a large number of files.

Option	Default	Type	Description
auto-compaction	false	Boolean	Whether to enable automatic compaction in streaming sink or not. The data will be written to temporary files. After the checkpoint is completed, the temporary files generated by a checkpoint will be compacted. The temporary files are invisible before compaction.
compaction.file-size	(none)	String	The compaction target file size, the default value is the rolling file size.
inlong.metric.labels	(none)	String	Inlong metric label, format of value is groupId=[groupId]&streamId=[streamId]&nodeId=[nodeId].

Partition Commit

After writing a partition, it is often necessary to notify downstream applications. For example, add the partition to a Hive metastore or writing a _SUCCESS file in the directory. The file system sink contains a partition commit feature that allows configuring custom policies. Commit actions are based on a combination of triggers and policies.

Option	Default	Type	Description
sink.partition-commit.trigger	process-time	String	Trigger type for partition commit: `process-time`: based on the time of the machine, it neither requires partition time extraction nor watermark generation. Commit partition once the `current system time` passes `partition creation system time` plus `delay`. `partition-time`: based on the time that extracted from partition values, it requires watermark generation. Commit partition once the `watermark` passes `time extracted from partition values` plus `delay`.
sink.partition-commit.delay	0 s	Duration	The partition will not commit until the delay time. If it is a daily partition, should be `1 d`, if it is a hourly partition, should be `1 h`.
sink.partition-commit.watermark-time-zone	UTC	String	The time zone to parse the long watermark value to TIMESTAMP value, the parsed watermark timestamp is used to compare with partition time to decide the partition should commit or not. This option is only take effect when `sink.partition-commit.trigger` is set to `partition-time`. If this option is not configured correctly, e.g. source rowtime is defined on TIMESTAMP_LTZ column, but this config is not configured, then users may see the partition committed after a few hours. The default value is `UTC`, which means the watermark is defined on TIMESTAMP column or not defined. If the watermark is defined on TIMESTAMP_LTZ column, the time zone of watermark is the session time zone. The option value is either a full name such as `America/Los_Angeles`, or a custom timezone id such as `GMT-08:00`.

Partition Commit Policy

The partition strategy defines the specific operation of partition submission.

metastore：This strategy is only supported when hive.
success: The _SUCCESS file will be generated after the part file is generated.

Option	Required	Default	Type	Description
sink.partition-commit.policy.kind	optional	(none)	String	Policy to commit a partition is to notify the downstream application that the partition has finished writing, the partition is ready to be read. metastore: add partition to metastore. Only hive table supports metastore policy, file system manages partitions through directory structure. success-file: add `_success` file to directory. Both can be configured at the same time: `metastore,success-file`. custom: use policy class to create a commit policy. Support to configure multiple policies: `metastore,success-file`.
sink.partition-commit.policy.class	optional	(none)	String	The partition commit policy class for implement PartitionCommitPolicy interface. Only work in custom commit policy.
sink.partition-commit.success-file.name	optional	_SUCCESS	String	The file name for success-file partition commit policy, default is `_SUCCESS`.

Overview​

How to create a HDFS Load Node​

Usage for SQL API​

File Formats​

Rolling Policy​

File Compaction​

Partition Commit​

Partition Commit Policy​