Configuring metadata import for data integration assets

Last updated: Oct 04, 2024

Data integration uses technical and business processes to combine data from different sources and create information that is meaningful and valuable for your enterprise. Data integration tools can use extract, transform, and load (ETL) patterns to transform and organize data based on business needs. Such ETL jobs move the data from source systems to target systems.

You can import lineage metadata from InfoSphere DataStage and DataStage for Cloud Pak for Data. You cannot import assets from these sources.

InfoSphere DataStage

When you create a lineage metadata import for InfoSphere DataStage, you can set options specific to this data source, and define the scope of data for which lineage is generated. For details about metadata import, see Designing metadata imports.

External inputs

You can provide InfoSphere DataStage assets such as jobs, stages, or operational metadata file as external inputs in a ZIP file. The folder must have the following structure:

<job_name.xml> - The parallel DataStage jobs that you want to analyze and the parameter sets that you use in those jobs as one XML file, exported from the designer client.
omd_files - A folder for operational metadata (OMD) files.
- <omd_file_name> - Operational metadata files that are collected during job runs to resolve any type of parameter in jobs.
sql_files - A folder that contains all SQL files.
- <sql_file_name> - A file with SQL queries.
DSParams - A file that contains the project- or environment-level parameters if applicable. You can get this file from the project directory.
datastageParameterOverride.txt - A file with parameter-set overrides if your jobs use parameter sets.
connection_definition/odbcConnectionDefinition.ini - A file with connection definitions for ODBC connections. Definitions of ODBC connections are not included in the DataStage XML exports and must be specified separately.
datastageComponentOverrides.csv - A file with component-lineage overrides.

The datastageParameterOverride.txt file format
The datastageParameterOverride.txt file has the following content:

[ENVIRONMENT]
PARAM1_NAME = "param1_value"
PARAM2_NAME = "param2_value"
PARAM3_NAME = "param3_value"
[PARAMETER_SET/parameter_set_name]
param4_name  = "default_param4_value"
param5_name  = "default_param5_value"
$PARAM3_NAME = "$PROJDEF"
[VALUE_FILE/parameter_set_name/value_file1_name]
param4_name  = "some_param4_value"
param5_name  = "some_param5_value"
$PARAM3_NAME = "some_param3_value"
[VALUE_FILE/parameter_set_name/value_file2_name]
param4_name  = "other_param4_value"
param5_name  = "other_param5_value"
$PARAM3_NAME = "other_param3_value"
[JOB/job1_name]
param6_name = "param6_value"
param7_name = "param7_value"
[JOB/job2_name]
param7_name = "param8_value"

The connection_definition/odbcConnectionDefinition.ini file format
The connection_definition/odbcConnectionDefinition.ini file has the following content. Create a separate [Shortcut_Name] secion for each connection.

[<Shortcut_Name>]
Type=<connection_type>
Connection_String=<connection_string>
Server_Name=<server_name>
Database_Name=<database_name>
Schema_Name=<schema_name>
User_Name=<user_name>

Shortcut_Name: The name of the connection or data server that is used by the data integration tool.
connection_type: The type of data source.
connection_string: A JDBC connection string or any identification of the database such as the system ID (SID) or the host name.
server_name: The value depends on the type of data source:
- Db2, Microsoft SQL Server, Netezza Performance Server, SAP ASE (formerly Sybase), or Teradata: The server name.
- FTP: The hostname.
- Oracle and other databases: The value is ignored.
database_name: The value depends on the type of data source:
- Oracle: The global database name.
- Db2, Microsoft SQL Server, Netezza Performance Server, SAP ASE (formerly Sybase), Teradata, and other databases: The name of the default database.
- user_name: The name of the user that logs in to the database.

Add a new line at the end of the parameters for each section.

The datastageComponentOverrides.csv file format
The datastageComponentOverrides.csv file has the following content:

"Full path to Stage";"Input Link name";"Input Column name";"Output Link name";"Output Column name";"Edge Type (DIRECT | FILTER)";"Description (optional)"
"manual_mapping_job/Generic_3";"DSLink2";"a";"DSLink5";"b";"DIRECT";""

The path to the stage is in the format Job/[Shared and Local containers optional]/Stage.

Advanced import options

Oracle proxy user authentication: You can use Oracle proxy user authentication. Set the value to true to change Oracle usernames in \"USERNAME[SCHEMA_OWNER]\" format to \"SCHEMA_OWNER\" format. In other cases, set the value to false.
Value files: Specify the names of value files to use in Parameter Sets in order of priority. For example, DEV1,TEST,PROD.

DataStage for Cloud Pak for Data

When you create a lineage metadata import for the DataStage for Cloud Pak for Data, you can set options specific to this data source, and define the scope of data for which lineage is generated. For details about metadata import, see Designing metadata imports.

External inputs

You can provide DataStage flows as external inputs in a ZIP file. The folder must have the following structure:

<project_export.zip> - A DataStage project exported to a ZIP file.
DSParams - A file that contains the project- or environment-level parameters if applicable. You can get this file from the project directory.
datastageParameterOverride.txt - A file with parameter-set overrides if your jobs use parameter sets.
connection_definition/odbcConnectionDefinition.ini - A file with connection definitions for ODBC connections. Definitions of ODBC connections are not included in the DataStage XML exports and must be specified separately.
datastageComponentOverrides.csv - A file with component-lineage overrides.

The format of the ZIP file with the exported DataStage project When you export a DataStage project, it must have the following structure:

assets - required folder.
- .METADATA - required folder.
- data_intg_flow.*.json - required files that contain information about flows.
- connection.*.json - optional files that contain information about connections.
- parameter_set.*.json - optional files that contain information about parameter sets.
- job.*.json - optional files that contain information about jobs.
- job_run.*.json - optional files that contain information about particular executions of the job.
- data_intg_flow - required folder.
- At least one file that contains the string "schemas":[{, but does not end in px_executables.
assettypes - required folder.
project.json - required file. There might be multiple instances of this file as a result of ZIP decompression, which is correct.

The datastageParameterOverride.txt file format
The datastageParameterOverride.txt file has the following content:

[ENVIRONMENT]
PARAM1_NAME = "param1_value"
PARAM2_NAME = "param2_value"
PARAM3_NAME = "param3_value"
[PARAMETER_SET/parameter_set_name]
param4_name  = "default_param4_value"
param5_name  = "default_param5_value"
$PARAM3_NAME = "$PROJDEF"
[VALUE_FILE/parameter_set_name/value_file1_name]
param4_name  = "some_param4_value"
param5_name  = "some_param5_value"
$PARAM3_NAME = "some_param3_value"
[VALUE_FILE/parameter_set_name/value_file2_name]
param4_name  = "other_param4_value"
param5_name  = "other_param5_value"
$PARAM3_NAME = "other_param3_value"
[JOB/job1_name]
param6_name = "param6_value"
param7_name = "param7_value"
[JOB/job2_name]
param7_name = "param8_value"

[<Shortcut_Name>]
Type=<connection_type>
Connection_String=<connection_string>
Server_Name=<server_name>
Database_Name=<database_name>
Schema_Name=<schema_name>
User_Name=<user_name>

Shortcut_Name: The name of the connection or data server that is used by the data integration tool.
connection_type: The type of data source.
connection_string: A JDBC connection string or any identification of the database such as the system ID (SID) or the host name.
server_name: The value depends on the type of data source:
- Db2, Microsoft SQL Server, Netezza Performance Server, SAP ASE (formerly Sybase), or Teradata: The server name.
- FTP: The hostname.
- Oracle and other databases: The value is ignored.
database_name: The value depends on the type of data source:
- Oracle: The global database name.
- Db2, Microsoft SQL Server, Netezza Performance Server, SAP ASE (formerly Sybase), Teradata, and other databases: The name of the default database.
- user_name: The name of the user that logs in to the database.

Add a new line at the end of the parameters for each section.

The datastageComponentOverrides.csv file format
The datastageComponentOverrides.csv file has the following content:

"Full path to Stage";"Input Link name";"Input Column name";"Output Link name";"Output Column name";"Edge Type (DIRECT | FILTER)";"Description (optional)"
"manual_mapping_job/Generic_3";"DSLink2";"a";"DSLink5";"b";"DIRECT";""

The path to the stage is in the format Job/[Shared and Local containers optional]/Stage.

Advanced import options

Analyze job runs: Specifies whether job runs are analyzed.
Analyze job runs since: Specifies the date after which runs are analyzed. If the value is empty, all runs are analyzed. Example value: 1970/01/01 00:00:00.000.
Analyze jobs separately: Specifies whether to analyze job separately, even when other runs are associated with them.
Analyze flows without jobs: Specifies whether flows without jobs are analyzed.
Oracle proxy user authentication: You can use Oracle proxy user authentication. Set the value to true to change Oracle usernames in \"USERNAME[SCHEMA_OWNER]\" format to \"SCHEMA_OWNER\" format. In other cases, set the value to false.
Value files: Specify the names of value files to use in Parameter Sets in order of priority. For example, DEV1,TEST,PROD.

Learn more

What is ETL?

Parent topic: Designing metadata imports