Skip to main content

· 4 min read

Apache InLong is a one-stop integration framework for massive data that provides automatic, secure and reliable data transmission capabilities. InLong supports both batch and stream data processing at the same time, which offers great power to build data analysis, modeling and other real-time applications based on streaming data.

1.2.0 Features Overview

The just-released 1.2.0-incubating version closes about 410+ issues, contains 30+ features and 190+ optimizations. Mainly include the following:

Enhance management and control capabilities

  • Dashboard and Manager add cluster management capabilities
  • Dashboard optimizes the flow creation process
  • Manager supports plug-in extension of MQ

Extended collection node

  • Support for collecting data in Pulsar
  • Support data collection in MongoDB-CDC
  • Support data collection in MySQL-CDC
  • Support data collection in Oracle-CDC
  • Support data collection in PostgreSQL-CDC
  • Support data collection in SQLServer-CDC

Extended write node

  • Support for writing data to Kafka
  • Support for writing data to HBase
  • Support for writing data to PostgreSQL
  • Support for writing data to Oracle
  • Supports writing data to MySQL
  • Support writing data to TDSQL-PostgreSQL
  • Support for writing data to Greenplum
  • Supports writing data to SQLServer

Support data conversion

  • Support String Split
  • Support String Regular Replace
  • Support String Regular Replace First Matched Value
  • Support Data Filter
  • Support Data Distinct
  • Support Regular Join

Enhanced system monitoring function

  • Support the reporting and management of data link heartbeat

Other optimizations

  • Supports the delivery of DataProxy multi-cluster configurations
  • GitHub Action check, pipeline optimization

1.2.0 Features Details

Support multi-cluster management

Manager adds cluster management function, supports multi-cluster configuration, and solves the limitation that only one set of clusters can be defined through configuration files. Users can create different types of clusters on Dashboard as needed.

The multi-cluster feature is mainly designed and implemented by @healchow, @luchunliang, @leezng, thanks to three contributors.

Enhanced collection of file data and MySQL Binlog

Version 1.2.0 supports collecting complete file data, and also supports collecting data from the specified Binlog location in MySQL. This part of the work was done by @Greedyu.

Support whole database migration

Sort supports migration of data across the entire database, contributed by @EMsnap.

Supports writing data in Canal format

Support for writing data in Canal format to Kafka, contributed by @thexiay.

Optimize the HTTP request method in Manager Client

Optimized the way of executing HTTP requests in Manager Client, and added unit tests for Client, which reduces maintenance costs while reducing duplication of code. This feature was contributed by new contributor @leosanqing.

Supports running SQL scripts

Sort supports running SQL scripts, see INLONG-4405, thanks to @gong for contributing this feature.

This version supports the heartbeat reporting and management of data grouping, data flow and underlying components, which is the premise of the state management of each link of the subsequent system.

This feature was primarily designed and contributed by @baomingyu, @healchow and @kipshi.

Manager supports the creation of resources in multiple flow directions

In version 1.2.0, Manager added the creation of some storage resources:

  • Create Topic for Kafka (contributed by @woofyzhao)
  • Create databases and tables for Iceberg (contributed by @woofyzhao)
  • Create namespaces and tables for HBase (contributed by @woofyzhao)
  • Create databases and tables for ClickHouse (contributed by @lucaspeng12138)
  • Create indices for Elasticsearch (contributed by @lucaspeng12138)
  • Create databases and tables for PostgreSQL (contributed by @baomingyu)

Sort supports lightweight architecture

Version 1.2.0 of Sort has done a lot of refactoring and improvements. By introducing Flink-CDC, it supports a variety of Extract and Load nodes, and also supports data transformation (ie Transform).

This feature contains many sub-features. The main developers are: @baomingyu, @EMsnap, @GanfengTan, @gong, @lucaspeng12138, @LvJiancheng, @kipshi, @thexiay, @woofyzhao, @yunqingmoswu, thank you all for your contributions.

For more information, please refer to: Analysis of InLong Sort ETL Solution.

Other features and bug fixes

For related content, please refer to the Release Notes, which details the features, enhancements and bug fixes of this release.

Apache InLong follow-up planning

In subsequent versions, we will expand more data sources and storages to cover more usage scenarios, and gradually improve the usability and robustness of the system, including:

  • Heartbeat report of each component
  • Status management of data flow
  • Full link audit support for writing to ClickHouse
  • Expand more types of acquisition nodes and storage nodes

· 11 min read

1. Background

With the increasing number of users and developers of Apache InLong(incubating), the demand for richer usage scenarios and low-cost operation is getting stronger and stronger. Among them, the demand for adding Transform (T) to the whole link of InLong has received the most feedback. After the research and design of @yunqingmoswu, @EMsnap, @gong, @thexiay community developers, the InLong Sort ETL solution based on Flink SQL has been completed. This article will introduce the implementation details of the solution in detail.

Firstly, based on Apache Flink SQL, there are mainly the following considerations:

  • Flink SQL has high scalability and flexibility brought about by its powerful expression ability. Basically, Flink SQL can support most demand scenarios in the community. When the built-in functions of Flink SQL do not meet the requirements, we can also extend them through various UDFs.
  • Compared with the implementation of the underlying API of Flink, the development cost of Flink SQL is lower. Only for the first time, the conversion logic of Flink SQL needs to be implemented. In the future, we can focus on the construction of the ability of Flink SQL, such as the extension connector and the UDF.
  • In general, Flink SQL will be more robust and run more stable. The reason is that Flink SQL shields a lot of the underlying details of Flink, has strong community support, and has been practiced by a large number of users.
  • For users, Flink SQL is also easier to understand, especially for users who have used SQL, the usage is simple and familiar, which helps users to land quickly.
  • For the migration of existing real-time tasks, if they are originally SQL-type tasks, especially Flink SQL tasks, the migration cost is extremely low, and in some cases, no changes are even required.

Note: For all codes of this scheme, please refer to Apache InLong Sort, which can be downloaded and used in the upcoming version 1.2.0.

2. Introduction

2.1 Requirements

The main requirements of this solution are the completed inlong sort module transform (T) capability, including:

TransformNotes
Deduplication in the windowDeduplicate data within a time window
time window aggregationAggregate data within a time window
time format conversionConverts the value of a field to a string in the target time format
field segmentationSplit a field into multiple new fields by a delimiter
string replacementReplace some or all of the contents of a string field
Data filteringDiscard or retain data that meets the filter conditions
Content extractionExtract part of a field to create a new field
JoinSupport two table join
Value substitutionGiven a matching value, if the field's value is equal to that value, replace it with the target value

2.2 Usage Scenarios

Users of big data integration have transform requirements such as data transformation, connection and filtering in many business scenarios.

2.3 Design Goal

This design needs to achieve the following goals:

  • Functionality: Under InLong Sort's existing architecture and data flow model, it covers basic Transform capabilities and has the ability to expand rapidly.
  • Compatibility: The new InLong Sort data model is forward compatible to ensure that historical tasks can be configured and run properly.
  • Maintainability: The conversion of the InLong Sort data model to Flink SQL only needs to be implemented once. When there are new functional requirements later, this part does not need to be changed, even if there are changes, it can be supported with a small amount of changes.
  • Extensibility: When the open source Flink Connector or the built-in Flink SQL function does not meet the requirements, you can customize the Flink Connector and UDF to achieve its function expansion.

2.4 Basic Concepts

The core concept refers to the explanation of terms in the outline design

NameMeaning
InLong DashboardInlong front end management interface
InLong Manager ClientWrap the interface in the manager for external user programs to call without going through the front-end inlong dashboard
InLong Manager OpenapiInlong manager and external system call interface
InLong Manager metaDataInlong manager metadata management, including metadata information of group and stream dimensions
InLong Manager task managerInlong manager manages the data source collection task module, manages agent task distribution, instruction distribution, and heartbeat reporting
InLong GroupData flow group, including multiple data flows, one group represents one data access
InLong StreamData flow: a data flow has a specific flow direction
Stream SourceThere are corresponding acquisition end and sink end in the stream. This design only involves the stream source
Stream InfoAbstract of data flow in sort, including various sources, transformations, destinations, etc. of the data flow
Group InfoEncapsulation of data flow in sort. A group info can contain multiple stream infos
NodeAbstraction of data source, data transformation and data destination in data synchronization
Extract NodeSource side abstraction of data synchronization
Load NodeDestination abstraction of data synchronization
MySQL Extract NodeMySQL data source abstraction
Kafka Load NodeKafka data destination abstraction
Transform NodeTransformation process abstraction of data synchronization
Aggregate Transform NodeData synchronization aggregation class transformation process abstraction
Node RelationRelationship abstraction of nodes in data synchronization
Field RelationAbstraction of the relationship between upstream and downstream node fields in data synchronization
FunctionAbstraction of the relationship between upstream and downstream node fields in data synchronization
Substring FunctionAbstraction of string interception function
Filter FunctionAbstraction of data filter function
Function ParamInput parameter abstraction of function
Constant ParamConstant parameters
Field InfoNode field
Meta FieldInfoNode meta information field

2.5 Domain Model

This design mainly involves the following entities:

Group, Stream, GroupInfo, StreamInfo, Node, NodeRelation, FieldRelation, Function, FilterFunction, SubstringFunction, FunctionParam, FieldInfo, MetaFieldInfo, MySQLExtractNode, KafkaLoadNode, etc.

For ease of understanding, this section will model and analyze the relationship between entities. Description of entity correspondence of domain model:

  • One group corresponds to one group info
  • A group contains one or more streams
  • One stream corresponds to one StreamInfo
  • A GroupInfo contains one or more StreamInfo
  • A StreamInfo contains multiple nodes
  • A StreamInfo contains one or more NodeRelations
  • A NodeRelation contains one or more FieldRelations
  • A NodeRelation contains 0 or more FilterFunctions
  • A FieldRelation contains one function or one FieldInfo as the source field and one FieldInfo as the target field
  • A function contains one or more FunctionParams

The above relationship can be represented by UML object relationship diagram as:

sort_UML

2.6 Function Use-case Diagram

sort-usecase

3. System Outline Design

3.1 System Architecture Diagram

architecture

  • Serialization: Serialization Implementation Module
  • Deserialization: Deserialization Implementation Module
  • Flink Source: Custom Flink source implementation module
  • Flink Sink: Custom Flink sink implementation module
  • Transformation: Custom Transform implementation module
  • GroupInfo: Corresponding to Inlong group
  • StreamInfo: Corresponding to inlong stream
  • Node: Abstraction of data source, data conversion and data destination in data synchronization
  • FlinkSQLParser: SQL parser

3.2 InLong Sort Internal Operation Flow Chart

sort-operation-flow

3.3 Module Design

This design only adds Flink connector and Flink SQL generator to the original system, and modifies the data model module.

3.3.1 Module Structure

sort-module-structure

3.3.2 Module Division

Description of important module division:

NameDescription
FlinkSQLParserUsed to generate Flink SQL core classes, including references to GroupInfo
GroupInfoThe internal abstraction of sort for inlong group is used to encapsulate the synchronization related information of the entire inlong group, including the reference to list\<StreamInfo>
StreamInfoThe internal abstraction of sort to inlong stream is used to encapsulate inlong stream synchronization related information, including references to list\<node>, list\<NodeRelation>
NodeThe top-level interface of the synchronization node. Its subclass implementation is mainly used to encapsulate the data of the synchronization data source and the transformation node
ExtractNodeData extract node abstraction, inherited from node
LoadNodeData load node abstraction, inherited from node
TransformNodeData transformation node abstraction, inherited from node
NodeRelationDefine relationships between nodes
FieldRelationDefine field relationships between nodes
FunctionAbstract of T-ability execution function
FilterFunctionFunction abstraction for data filtering, inherited from function
SubstringFunctionUsed for string interception function abstraction, inherited from function
FunctionParamAbstraction for function parameters
ConstantParamEncapsulation of function constant parameters, inherited from FunctionParam
FieldInfoThe encapsulation of node fields can also be used as function input parameters, inherited from FunctionParam
MetaFieldInfoThe encapsulation of built-in fields is currently mainly used in the metadata field scenario of canal JSON, which is inherited from FieldInfo

4. Detailed System Design

The following describes the principle of SQL generation by taking MySQL synchronizing data to Kafka as an example

4.1 Node Described in SQL

4.1.1 ExtractNode Described in SQL

The node configuration is:

 private Node buildMySQLExtractNode() {
List<FieldInfo> fields = Arrays.asList(
new FieldInfo("name", new StringFormatInfo()),
new FieldInfo("age", new IntFormatInfo()));
return new MySqlExtractNode("1", "mysql_input", fields,
null, null, "id",
Collections.singletonList("tableName"), "localhost", "root", "password",
"inlong", null, null,
null, null);
}

The generated SQL is:

CREATE TABLE `mysql_1` (`name` string,`age` int) 
with
('connector' = 'mysql-cdc-inlong',
'hostname' = 'localhost',
'username' = 'root',
'password' = 'password',
'database-name' = 'inlong',
'table-name' = 'tableName')

4.1.2 TransformNode Described in SQL

The node configuration is:

 List<FilterFunction> filters = Arrays.asList(
new SingleValueFilterFunction(EmptyOperator.getInstance(),
new FieldInfo("age", new IntFormatInfo()),
LessThanOperator.getInstance(), new ConstantParam(25)),
new SingleValueFilterFunction(AndOperator.getInstance(),
new FieldInfo("age", new IntFormatInfo()),
MoreThanOrEqualOperator.getInstance(), new ConstantParam(18))
);

The generated SQL is:

SELECT `name` AS `name`,`age` AS `age` FROM `mysql_1` WHERE `age` < 25 AND `age` >= 18

4.1.3 LoadNode Described in SQL

The node configuration is:

 private Node buildKafkaLoadNode(FilterStrategy filterStrategy) {
List<FieldInfo> fields = Arrays.asList(
new FieldInfo("name", new StringFormatInfo()),
new FieldInfo("age", new IntFormatInfo())
);
List<FieldRelation> relations = Arrays
.asList(
new FieldRelation(new FieldInfo("name", new StringFormatInfo()),
new FieldInfo("name", new StringFormatInfo())),
new FieldRelation(new FieldInfo("age", new IntFormatInfo()),
new FieldInfo("age", new IntFormatInfo()))
);
List<FilterFunction> filters = Arrays.asList(
new SingleValueFilterFunction(EmptyOperator.getInstance(),
new FieldInfo("age", new IntFormatInfo()),
LessThanOperator.getInstance(), new ConstantParam(25)),
new SingleValueFilterFunction(AndOperator.getInstance(),
new FieldInfo("age", new IntFormatInfo()),
MoreThanOrEqualOperator.getInstance(), new ConstantParam(18))
);
return new KafkaLoadNode("2", "kafka_output", fields, relations, filters,
filterStrategy, "topic1", "localhost:9092",
new CanalJsonFormat(), null,
null, "id");
}

The generated SQL is:

CREATE TABLE `kafka_3` (`name` string,`age` int) 
with (
'connector' = 'kafka-inlong',
'topic' = 'topic1',
'properties.bootstrap.servers' = 'localhost:9092',
'format' = 'canal-json-inlong',
'canal-json-inlong.ignore-parse-errors' = 'true',
'canal-json-inlong.map-null-key.mode' = 'DROP',
'canal-json-inlong.encode.decimal-as-plain-number' = 'true',
'canal-json-inlong.timestamp-format.standard' = 'SQL',
'canal-json-inlong.map-null-key.literal' = 'null'
)

4.2 Field T Described in SQL

4.2.1 Filter operator

See 4.1 node configuration for relevant configurations

The generated SQL is:

INSERT INTO `kafka_3` SELECT `name` AS `name`,`age` AS `age` FROM `mysql_1` WHERE `age` < 25 AND `age` >= 18

4.2.2 Watermark

The complete configuration of GroupInfo is as follows:

private Node buildMySqlExtractNode() {
List<FieldInfo> fields = Arrays.asList(
new FieldInfo("name", new StringFormatInfo()),
new FieldInfo("age", new IntFormatInfo()),
new FieldInfo("ts", new TimestampFormatInfo()));
WatermarkField wk = new WatermarkField(new FieldInfo("ts", new TimestampFormatInfo()),
new StringConstantParam("1"),
new TimeUnitConstantParam(TimeUnit.MINUTE));
return new MySqlExtractNode("1", "mysql_input", fields,
wk, null, "id",
Collections.singletonList("tableName"), "localhost", "root", "password",
"inlong", null, null,
null, null);
}

private Node buildKafkaNode() {
List<FieldInfo> fields = Arrays.asList(
new FieldInfo("name", new StringFormatInfo()),
new FieldInfo("age", new IntFormatInfo()),
new FieldInfo("ts", new TimestampFormatInfo()));
List<FieldRelation> relations = Arrays
.asList(new FieldRelation(new FieldInfo("name", new StringFormatInfo()),
new FieldInfo("name", new StringFormatInfo())),
new FieldRelation(new FieldInfo("age", new IntFormatInfo()),
new FieldInfo("age", new IntFormatInfo()))
);
return new KafkaLoadNode("2", "kafka_output", fields, relations, null, null,
"topic", "localhost:9092", new JsonFormat(),
1, null, "id");
}

private NodeRelation buildNodeRelation(List<Node> inputs, List<Node> outputs) {
List<String> inputIds = inputs.stream().map(Node::getId).collect(Collectors.toList());
List<String> outputIds = outputs.stream().map(Node::getId).collect(Collectors.toList());
return new NodeRelation(inputIds, outputIds);
}

@Override
public GroupInfo getTestObject() {
Node input = buildMySqlExtractNode();
Node output = buildKafkaNode();
StreamInfo streamInfo = new StreamInfo("1", Arrays.asList(input, output), Collections.singletonList(
buildNodeRelation(Collections.singletonList(input), Collections.singletonList(output))));
return new GroupInfo("1", Collections.singletonList(streamInfo));
}

· 5 min read

Apache InLong is a one-stop integration framework for massive data that provides automatic, secure and reliable data transmission capabilities. InLong supports both batch and stream data processing at the same time, which offers great power to build data analysis, modeling and other real-time applications based on streaming data.

1.1.0 Features Overview

The 1.1.0-incubating just released mainly includes the following:

Enhanced management capabilities for manager

  • Added Manager Client to support the integration of InLong for creating data streams
  • Added ManagerCtl command-line tool to support viewing and creating data streams
  • Manager supports dispatching Agent tasks
  • Manager supports dispatching Sort Flink tasks
  • Manger data streams management, supports start, restart, pause operations
  • Manager plugin support
  • Manager metadata management supports using MySQL
  • The first phase of cluster management, unified cluster information registration

Rich data nodes

  • Support Iceberg
  • Support ClickHouse
  • Support MySQL Binlog collection
  • Support Kafka ingestion
  • Sort Standalone supports multiple type streams

Tools construction

  • Dashboard plugin support
  • Kubernetes deployment optimization
  • Standalone deployment refactoring

System Upgrade

  • Protocol Buffers upgrade
  • Unified version Maven version dependencies
  • Fixed a bunch of dependency CVEs
  • DataProxy supports PB compression protocol

This version closed about 642+ issues, including four 23 features and 180 improvements.

1.1.0 Features Details

Add Manager Client

The newly added Manager Client defines common InLong Group/Stream operation interfaces, including task creation, viewing and deletion. Through Manager Client, users can build InLong into any third-party platform to achieve unified customized platform construction. The Manager Client is mainly designed and implemented by @kipshi, @gong, @ciscozhou, thanks to three contributors.

Add ManagerCtl command line tool

ManagerCtl is developed based on Manager Client and is a command-line tool for manipulating InLong resources, which can further simplify the use of developers. ManagerCtl was contributed independently by @haifxu and includes guidelines including:

Usage: managerctl [options] [command] [command options]
Options:
-h, --help
Get all command about managerctl.
Commands:
create Create resource by json file
Usage: create [options]

describe Display details of one or more resources
Usage: describe [options]

list Displays main information for one or more resources
Usage: list [options]

Manager supports issuing Sort tasks

In previous versions, after the user created the data group/stream task, Sort needed to submit it to the Flink cluster through the command line, and then perform data sorting. In version 1.1.0, we unified the start, stop, and suspend operations of Sort tasks to the Manager to complete. Users only need to configure the correct Flink cluster when the Manager is deployed. When the data stream is approved, Sort will be automatically started. This part of the work is mainly contributed by @LvJiancheng.

Manager metadata is saved to ZooKeeper

InLong uses ZooKeeper to save data stream metadata, which increases the deployment threshold and operation and maintenance difficulty for developers and users. In version 1.1.0, we save data stream metadata in DB by default, and ZooKeeper is only an optional solution in large-scale scenarios. Thanks to @kipshi @yunqingmoswu for contributing a PR to ZooKeeper.

Support MySQL Binlog collection

Version 1.1.0 fully supports the collection of data from MySQL, and supports both incremental and full strategies. Users can collect MySQL data with simple configuration in InLong. This part of the work was contributed by @EMsnap.

Sort Added Iceberg, ClickHouse, Kafka

In version 1.1.0, the storage of data nodes for various scenarios has been added, including Iceberg, ClickHouse, and Kafka. The support of these three streams enriches the usage scenarios of InLong. New stream support, mainly contributed by @chantccc @KevinWen007 @healchow.

Sort Standalone supports Hive, Elasticsearch, Kafka

As mentioned in the previous version, for non-Flink environments, we can sort data streams through Sort Standalone. In version 1.1.0, we added support for Hive, ElasticSearch, Kafka, and expanded the usage scenarios of Sort Standalone. Sort Standalone is mainly contributed by @vernedeng @luchunliang.

Protocol Buffers upgrade

All InLong components Protocol Buffers dependencies have been upgraded from 2.5.0 to 3.19.4. Thanks to @gosonzhang @doleyzi for their contributions, a lot of compatibility and testing work for Protocol Buffers upgrades.

InLong on Kubernetes optimization

The optimization work of InLong on Kubernetes mainly includes adding Audit, combing configuration, optimizing the use of message queues, optimizing the use of documents, etc., to facilitate the use of InLong on the cloud. Thanks to @shink for contributing to these optimizations.

Dashboard plugin

In order to facilitate users to quickly build new data stream on Dashboard, Dashboard is support plugin in version 1.1.0. JavaScript developers who understand the plugin development guidelines can quickly expand new data stream. Thanks for this part of the work @leezng

Other features and bug fixes

For related content, please refer to the version release notes, which list the features, enhancements and bug fixes of this version in detail, as well as specific contributors.

Apache InLong(incubating) follow-up planning

In subsequent versions, we will support lightweight Sort, and expand more data sources and targets to cover more usage scenarios, including:

  • Flink SQL support
  • Elasticsearch, HBase support

· 6 min read

InLong: the sacred animal in Chinese myths stories, draws rivers into the sea, as a metaphor for the InLong system to provide data access capabilities.

Apache InLong is a one-stop integration framework for massive data that provides automatic, secure and reliable data transmission capabilities. InLong supports both batch and stream data processing at the same time, which offers great power to build data analysis, modeling and other real-time applications based on streaming data. The 0.12.0-incubating just released mainly includes the following:

  • Provide automatic, safe, reliable and high-performance data transmission capabilities, while supporting batch and streaming
  • Refactor the document structure of the official website to facilitate users to consult related documents based on the main line
  • The whole process supports Pulsar data flow, and completes the whole process coverage of high-performance and high-reliability scenarios
  • Full-process support indicators, including JMX and Prometheus output
  • The first phase of data audit and reconciliation, write audit data to MySQL

This version closed about 120+ issues, including four major features and 35 improvements.

Apache InLong(incubating) Introduction

Apache InLong is a one-stop integration framework for massive data donated by Tencent to the Apache community. It provides automatic, safe, reliable, and high-performance data transmission capabilities to facilitate the construction of streaming-based data analysis, modeling, and applications.
The Apache InLong project was originally called TubeMQ, focusing on high-performance, low-cost message queuing services. In order to further release the surrounding ecological capabilities of TubeMQ, we upgraded the project to InLong, focusing on creating a one-stop integration framework for massive data.

Apache InLong uses TDBank internally used by Tencent as the prototype, and relies on trillion-level data access and processing capabilities to integrate the entire process of data collection, aggregation, storage, and sorting data processing. It is simple to use, flexible to expand, stable and reliable characteristic.

Apache InLong

Apache InLong serves the entire life cycle from data collection to landing, and provides different processing modules according to different stages of data, including the next modules:

  • inlong-agent, data collection agent, supports reading regular logs from specified directories or files and reporting data one by one. In the future, DB collection and HTTP reporting capabilities will also be expanded.
  • inlong-dataproxy, a Proxy component based on Flume-ng, supports data transmission blocking, placing retransmission, and has the ability to forward received data to different MQ (message queues).
  • inlong-tubemq, Tencent's self-developed message queuing service, focuses on high-performance storage and transmission of massive data in big data scenarios and has a relatively good core advantage in mass practice and low cost.
  • inlong-sort, after consuming data from different MQ services, perform ETL processing, and then aggregate and write the data into Apache Hive, ClickHouse, Hbase, IceBerg, etc.
  • inlong-manager, provides complete data service management and control capabilities, including metadata, OpenAPI, task flow, authority, etc.
  • inlong-website, a front-end page for managing data access, simplifying the use of the entire InLong control platform.

What’s New in Apache InLong(incubating) 0.12.0

1. Support Apache Pulsar data cache

In version 0.12.0, we have completed the data reporting capability of FileAgent→DataProxy→Pulsar→Sort. So far, InLong supports high-performance and high-reliability data access scenarios: Compared with the high-throughput TubeMQ, Apache Pulsar can provide better data consistency and is more suitable for scenarios that require extremely high data reliability. For example, finance and billing.

Report via Pulsar

Thanks to @healchow, @baomingyu, @leezng, @bruceneenhl, @ifndef-SleePy and others for their contributions to this feature. For more information, please refer to INLONG-1310incubator-inlong/issues/1310), please refer to [Pulsar usage example](https://inlong.apache. org/zh -CN/docs/next/quick_start/pulsar_example/) to get the usage guide.

2. Support JMX and Prometheus metrics

In addition to the existing file output metrics, the various components of InLong began to gradually support the output of JMX and Prometheus metrics to improve the visibility of the entire system. Currently, modules including Agent, DataProxy, TubeMQ, Sort-Standalone, etc. already support the above metrics, and the documentation of metrics output by each module is being sorted out.

Thanks to @shink, @luchunliang, @EMsnap, @gosonzhang and others for their contributions. For related PRs, please see INLONG-1712, [INLONG-1786] (https://github.com/apache/incubator-inlong/issues/1786), INLONG-1796, [INLONG-1827] (https://github.com/apache/incubator-inlong/issues/1827), INLONG-1851, [INLONG-1926] (https://github.com/apache/incubator-inlong/issues/1926).

3. Function extension of the modules

The Sort module adds support for Apache Doris storage and realizes the ability to load sorted data into Apache Doris, giving InLong one more storage option. In addition, in order to enrich the functions of the entire data access process, the Audit and Sort-Standalone modules have been added:

  • The Audit module provides the ability to reconcile the entire process of data flow, and monitor the service quality of the system through data flow indicators;
  • Sort-Standalone module is a non-Flink-based data sorting module. It adds lightweight data sorting capabilities to the system, facilitating the rapid construction and use of businesses.

The Audit and Sort-Standalone modules are still under development and will be released when the version is stable.

Thanks to @huzk8, @doleyzi, @luchunliang and others for their contributions, please refer to INLONG-1821, [INLONG-1738]( https: / /github.com/apache/incubator-inlong/issues/1738), INLONG-1889.

4. Official website document directory reconstruction

In addition to the improvement of functional modules in version 0.12.0, the official website structure and the use of documents have also been deeply adjusted, including the reconstruction of the document directory structure, supplementing and improving the function introduction of each module, adding document translation, and further improving the documentation of the InLong official website. Readability, to achieve quick search and easy reading. You can check the official website to understand this content. The document is still under construction. We welcome your valuable comments or suggestions.

Thanks to @bluewang, @dockerzhang, @healchow and others for their contributions, please refer to INLONG-1711, [INLONG-1740](https: //github.com/apache/incubator-inlong/issues/1740), INLONG-1802, [INLONG-1809](https: //github.com/apache/incubator-inlong/issues/1809), INLONG-1810, [INLONG-1815](https: //github.com/apache/incubator-inlong/issues/1815), INLONG-1822, [INLONG-1840](https: //github.com/apache/incubator-inlong/issues/1840), INLONG-1856, [INLONG-1861](https: //github.com/apache/incubator-inlong/issues/1861), INLONG-1867, [INLONG-1878](https: //github.com/apache/incubator-inlong/issues/1878), INLONG-1901, [INLONG-1939](https: //gith ub.com/apache/incubator-inlong/issues/1939).

5. Other features and bug fixes

For related content, please refer to Version Release Notes, which lists the detailed features of this version, Improvements, bug fixes, and specific contributors.

Apache InLong(incubating) follow-up planning

In subsequent versions, we will further enhance the capabilities of InLong to cover more usage scenarios, including:

  • Support link for data access ClickHouse
  • Support DB data collection
  • The second stage full link indicator audit function

· 5 min read

Apache InLong (incubating) has been renamed from the original Apache TubeMQ (incubating) from 0.9.0. With the name change, InLong has also been upgraded from a single message queue to a one-stop integration framework for massive data. InLong supports data collection, aggregation, caching, and sorting, users can import data from the data source to the real-time computing engine or land to offline storage with a simple configuration. The just-released version 0.11.0-incubating is the third version after the name changed. This version includes next features:

  • Lower the user's threshold for use further. Support all modules of InLong to be deployed on Kubernetes, and refactor the official website, so that users can more easily access related documents.
  • Support more usage scenarios, increase the data flow direction of Dataproxy -> Pulsar, and cover scenarios with higher requirements for data integrity and consistency.
  • Supports SDKs in more languages for TubeMQ. This version opens the production-level TubeMQ Go SDK, which is convenient for users who use the Go language to access

This version closed more than 80 issues, including four significant features and 35 improvements.

Apache InLong(incubating) Introduction

Apache InLong is a one-stop integration framework for massive data donated by Tencent to the Apache community. It provides automatic, safe, reliable, and high-performance data transmission capabilities to facilitate the construction of streaming-based data analysis, modeling, and applications.
The Apache InLong project was originally called TubeMQ, focusing on high-performance, low-cost message queuing services. In order to further release the surrounding ecological capabilities of TubeMQ, we upgraded the project to InLong, focusing on creating a one-stop integration framework for massive data.

Apache InLong uses TDBank internally used by Tencent as the prototype, and relies on trillion-level data access and processing capabilities to integrate the entire process of data collection, aggregation, storage, and sorting data processing. It is simple to use, flexible to expand, stable and reliable characteristic.

Apache InLong

Apache InLong serves the entire life cycle from data collection to landing, and provides different processing modules according to different stages of data, including the next modules:

  • inlong-agent, data collection agent, supports reading regular logs from specified directories or files and reporting data one by one. In the future, DB collection and HTTP reporting capabilities will also be expanded.
  • inlong-dataproxy, a Proxy component based on Flume-ng, supports data transmission blocking, placing retransmission, and has the ability to forward received data to different MQ (message queues).
  • inlong-tubemq, Tencent's self-developed message queuing service, focuses on high-performance storage and transmission of massive data in big data scenarios and has a relatively good core advantage in mass practice and low cost.
  • inlong-sort, after consuming data from different MQ services, perform ETL processing, and then aggregate and write the data into Apache Hive, ClickHouse, Hbase, IceBerg, etc.
  • inlong-manager, provides complete data service management and control capabilities, including metadata, OpenAPI, task flow, authority, etc.
  • inlong-website, a front-end page for managing data access, simplifying the use of the entire InLong control platform.

What’s New in Apache InLong(incubating) 0.11.0

InLong on Kubernetes

Apache InLong includes multiple components such as data collection, data aggregation, data caching, data sorting, and cluster management and control. In order to facilitate users to use and support cloud-native features, all components currently support deployment in Kubernetes. Thanks to @shink for contributing to this feature. For specific details, please refer to INLONG-1308.

Open up dataproxy->pulsar data flow

Before version 0.11.0, InLong's data caching layer could only support TubeMQ. TubeMQ is very suitable for scenarios with huge data volumes, but in extreme scenarios, there may be a small amount of data loss. To provide data reliability, the Inlong added support for Apache Pulsar in version 0.11.0. Now InLong backend can support data flow agent -> dataProxy -> tubeMQ/pulsar -> sort. The introduction of Pulsar makes the application scenarios covered by InLong more abundant, which could meet the needs of more users. Thanks to @baomingyu for his contribution to this feature. For more details, please refer to INLONG-1330.

Go SDK for InLong TubeMQ

Before version 0.11.0, InLong TubeMQ supports SDKs in three languages: Java, C++, and Python. With more and more applications of Go language, the demand for Go language SDK in the community is also increasing. Version 0.11.0 was officially introduced to the Go language SDK of TubeMQ. The multilingual SDK is enriched, and the difficulty of access and use for Go language users is also reduced. Thanks to @TszKitLo40 for contributing to this feature. For more details, please refer to:

refactor the official website

In version 0.11.0, InLong uses Docusaurus to refactor the official website to provide a more concise and intuitive document management and display. Thanks to @leezng for his contribution to this feature. For more details, please refer to:

In addition to the above major features, InLong 0.11.0 version has other key improvements, including but not limited to:

The 0.11.0 version also fixes ~45 bugs. Thanks to all the contributions who have contributed to the Inlong community (in no particular order):

  • shink
  • baomingyu
  • TszKitLo40
  • leezng
  • ruanwenjun
  • leo65535
  • luchunliang
  • pierre94
  • EMsnap
  • dockerzhang
  • gosonzhang
  • healchow
  • guangxuCheng
  • yuanboliu

Apache InLong(incubating) follow-up planning

In the subsequent version planning of InLong, we will further release the capabilities of InLong to cover more usage scenarios, mainly including

  • Support Apache Pulsar full link data access capabilities, including back-end processing and front-end management capabilities.
  • Support data flow such as ClickHouse, Apache Iceberg, Apache HBase, etc.