Skip to main content

Release 1.13.0

· 8 min read

Apache InLong recently released version 1.13.0, which closed about 275+ issues, including 6+ major features and 100+ optimizations. The main features include supporting SSH installation of agents, managing field templates, support configuring offline synchronization tasks, and collecting PostgreSQL through agents . After the release of 1.13.0, Apache InLong has enriched and optimized Agent function scenarios, enhanced the accuracy of Audit data measurement, and enriched the capabilities and applicable scenarios of Sort, solved the demand for quick troubleshooting in development and operation, and optimized the user experience of Apache InLong operation and maintenance.

About Apache InLong

As the industry's first one-stop, full-scenario, open-source massive data integration framework, Apache InLong provides automatic, safe, reliable, and high-performance data transmission capabilities to facilitate businesses to build stream-based data analysis, modeling, and applications quickly. At present, InLong is widely used in various industries such as advertising, payment, social networking, games, artificial intelligence, etc., serving thousands of businesses, among which the scale of high-performance scene data exceeds 1 trillion lines per day, and the scale of high-reliability scene data exceeds 10 trillion lines per day.

The core keywords of InLong project positioning are "one-stop" and "massive data". For "one-stop", we hope to shield technical details, provide complete data integration and support services, and implement out-of-the-box; With its advantages, such as multi-cluster management, it can stably support larger-scale data volumes based on trillions of lines per day.

1.13.0 Version Overview

Apache InLong recently released version 1.13.0, which closed about 275+ issues, including 6+ major features and 100+ optimizations. The main features include supporting SSH installation of agents, managing field templates, support configuring offline synchronization tasks, and collecting PostgreSQL through agents . After the release of 1.13.0, Apache InLong has enriched and optimized Agent function scenarios, enhanced the accuracy of Audit data measurement, and enriched the capabilities and applicable scenarios of Sort, solved the demand for quick troubleshooting in development and operation, and optimized the user experience of Apache InLong operation and maintenance. In Apache InLong 1.13.0 version, a large number of other features have also been completed, mainly including:

Agent Module

  • Support data version numbers to distinguish between normal data and supplementary data
  • Location storage supports plugins, currently supporting Rocksdb and Zookeeper
  • Support configuration version number comparison to prevent repeated configuration
  • Support minute level file collection
  • Add PostgreSQL and MongoDB data source collection

Manager Module

  • Support installing agents through SSH
  • Switch audit ID query from direct interaction with database to Audit SDK
  • Offline synchronization supports Pulsar -> MySQL
  • Support offline synchronous scheduling information management
  • File collection supports multi IP collection
  • Support obtaining Agent configuration information
  • Support automatic synchronization to Sink after modifying Stream field information
  • Support field template management
  • Data preview supports KV format
  • Data preview supports querying based on field filtering criteria

Dashboard Module

  • Add Source Data Field Template Page
  • Add monitoring and auditing page
  • Support installing agents through SSH key based authentication
  • Audit supports displaying total and variance audit data
  • File type data stream supports minute level cycles

Audit Module

  • Unified allocation and management of audit items using the Audit SDK
  • The Audit SDK supports automatic management of Audit Proxy addresses
  • Audit Store supports the universal JDBC protocol
  • Restarting the Audit Store optimization process may lead to data loss issues
  • Audit Service optimizes thread pool management
  • Audit Service compatible with historical audit data with empty Audit Tag
  • Audit Service Optimization for OpenAPI Audit Transmission Delay Calculation

Sort Module

  • Add JDBC connector on Flink 1.15
  • Added Pulsar connector on Flink 1.18
  • Redis connector supports reporting audit information
  • Kafka connector supports reporting audit information
  • MongoDB connector supports reporting audit information
  • PostgreSQL connector supports reporting audit information

1.13.0 Version Feature Introduction

Manager supports installing Agent by SSH

Through this feature, operation and maintenance personnel can install agents through the Dashboard, which currently supports SSH and manual installation methods. Users can create a new Agent cluster on the cluster management page.

1.13.0-agent-cluster.png

Afterwards, enter the node, select the new node, and configure the SSH username and password to achieve SSH installation agent capability. Thanks to @haifxu and @fuweng11. For more information, please refer to INLONG-10409.

1.13.0-agent-install.png

Manager supports field template management

With this feature, users can pre-configure field templates, and when creating a new Stream, they can select the pre-configured field templates, thus achieving the purpose of reusing configurations across multiple Streams. Thanks to @kamianlaida and @fuweng11. For more information, please refer to: INLONG-10330.

1.13.0-create-template.png

Support the construction of the underlying framework for configuring offline synchronization tasks

In version 1.13.0, Manager supports the configuration of offline synchronization tasks. Compared to real-time synchronization, offline data synchronization pays more attention to synchronization throughput and efficiency. Real-time synchronization tasks run in the manner of Flink stream tasks, while offline synchronization runs in the manner of Flink batch tasks. This approach can ensure the consistency of real-time and offline synchronization tasks' code as much as possible, reducing maintenance costs. The offline synchronization function of InLong will be combined with the scheduling system to synchronize the complete or incremental data of the data source information to the data target. The offline synchronization task is created by InLong Manager (including scheduling information), and the specific data synchronization logic is implemented through the InLong Sort module.

1.13.0-offline-architecture.png

Key Competency:

  • Job Type: Support single or periodic offline data synchronization.
  • Scheduling:The scheduling function is plugin based, with a built-in simple periodic scheduling function that allows customization of third-party scheduling systems to support more complex capabilities, such as task dependencies.
  • Compute Engine: Flink
  • Offline Job Operation and Maintenance: Job start,stop and running status monitoring
  • Special Handling: Dirty Data Processing Capability

The following is the core process:

1.13.0-dataflow-architecture.png

Thanks to @aloyszhang. For more information, please refer to: INLONG-10054, INLONG-10053, INLONG-10055, INLONG-10069.

Optimize the Sort Standalone configuration process

In version 1.13.0, the distribution process of Sort Standalone configuration was modified. In previous versions, there were the following issues with the distribution of Sort Standalone configuration:

  • Configuration changes are unreliable. Configuration changes will be updated in real-time to the manager's cache. Once the configuration is changed, Sort Standalone will sense it and write data based on the new configuration without verifying the configuration.
  • The configuration and construction process is repetitive and cumbersome. When pulling the Sort configuration from the database, it is a full pull and real-time build is performed after the pull is complete.

In the new version, after modifying the data target, the Sort configuration will not take effect in real time, but will need to be built and written into the sort_config table after executing the workflow. The following figure shows a process comparison:

1.13.0-manager-standalone.png

Thanks to @fuweng11, @vernedeng. For more information, please refer to: INLONG-9867, INLONG-10017.

Agent ability for collecting data from PostgreSQL

In version 1.13.0, Agent supports data collection from PostgreSQL. When creating a data source, you can directly select PostgreSQL and fill all relevant data source information to start using it. The parameters include:

  • Data source name: Used to distinguish between different data sources
  • Cluster name: The cluster of data source belongs
  • Data source IP: The IP information of data source
  • Server Host: PostgreSQL Server Host
  • Port: PostgreSQL port

Thanks to @haifxu. For more information, please refer to: INLONG-10318.

1.13.0-agent-postgreSQL.png

Sort connectors reporting audit information supports exactly once

Sort Connectors support for exactly once semantics in reporting audit information. When an operator encounters an exception and the snapshot fails, each Connector ensures that the audit information is reported exactly once. As shown in the following diagram, when data is transferred, the current data will be written to the current operator's AuditBuffer and the corresponding checkpointId. When the checkpoint is completed (within the notifyCompleteCheckpoint method), the data written to the buffer will be reported.

1.13.0-sort-exactly.png

During the execution of snapShot, the checkpointId written to the buffer will be changed to the checkpointId in the snapShot, and when the checkpoint is completed, the audit information corresponding to the checkpointId will be uploaded to the Audit Server within the notifyCompleteCheckpoint method. In the Audit SDK, it is already supportedto report all information in the Audit buffer that is less than the specified checkpointId.

1.13.0-sort-audit.png

Thanks to @XiaoYou201. For more information, please refer to: INLONG-10311, INLONG-10312, INLONG-10317, INLONG-10355, INLONG-10357, INLONG-10358, INLONG-10401.

Future plans

In version 1.13.0, the community refactored the process of Manager issuing Agent, Sort tasks, enriched features such as Flink 1.15 Connector, Inlong Agent collecting PostgreSQL and other functions. In subsequent versions, InLong will continue to enrich Flink 1.15 Connector, enhance Transform capabilities, support offline data integration, unify DataProxy data protocols, optimize Dashboard experience, etc. We look forward to more developers participating and contributing.