0 / 0
Designing metadata enrichments

Designing metadata enrichments

When you enrich asset metadata, you must decide which data assets to enrich, what type of metadata to add, and whether to schedule enrichment jobs.

Typically, metadata enrichment is part of a larger data curation plan. For example, after you import metadata for data assets, you can add business metadata to the imported data assets, identify relationships between the assets, and you can run analyze the data quality of these assets. Finally, you can publish the completed data assets to a catalog to share with your organization. Before you design your metadata enrichment, make sure that you understand the implications of your choices to your overall curation plan. See Planning for curation.

Project setup

Select or create the project in which you want to work. Remember that projects that are marked as sensitive do not allow for publishing to catalogs or for downloading data. Thus, they are not suitable if you want to share the enriched assets or download results for review in a spreadsheet.

As the project administrator, define default enrichment settings that apply to all metadata enrichments in the selected project. You can overwrite some of these settings when you create or edit your metadata enrichment.

Scope of enrichment

Usually, the first step when you enrich metadata is to select the data that you want to enrich. You can enrich relational and structured data assets.

Metadata enrichment is run on assets that are available in the project. Thus, the list of enriched assets in the enrichment results might not correspond to the configured scope of included metadata import assets in these cases:

  • Metadata import was not yet complete when the enrichment started.
  • Metadata import failed for a set of assets or failed completely.

Initial data scope

The Data assets list shows all assets of the supported formats. You can enrich relational and structured data assets. You can select individual assets, but you can also select metadata import assets to enrich the entire set of data assets from those metadata imports. However, you can't select data assets or metadata imports that are already included in a metadata enrichment. For individual data assets, you can hover over the asset name to see in which metadata enrichment the asset is included.

A metadata import asset is automatically excluded from the selection scope in these cases:

  • It has a catalog as the import target.

  • It was run on a connection that doesn't support access to the actual data.

    See Importing metadata.

    Remember: Each data asset or metadata import can be included in only one metadata enrichment per project. If you want to enrich a data asset several times with different enrichment options, you need to do that in separate projects.

If any of the connections for the selected data assets is configured to use personal instead of shared credentials, you must unlock that connection before you can proceed.

You can also create an empty metadata enrichment asset and set the scope later.

Scope of reruns of the enrichment

For reruns of the enrichment, whether scheduled or run manually, the data scope can be all assets from the initially selected data scope or a subset of assets. The default option is New and modified assets and assets not enriched in the previous run. With this option, assets are selected for enrichment as follows:

  • Assets that were added after the last run of the enrichment
  • Assets where columns were added or removed after the last run of the enrichment
  • Assets where asset or column descriptions changed after the last run of the enrichment
  • Assets for which the previous enrichment failed or was canceled

Enrichment is always run on the entire data asset regardless of whether an asset is new or modified.

The job run log shows reruns of metadata enrichments that are configured with the limited data scope as delta metadata enrichment job runs.

Enrichment objectives

You can choose from these enrichment objectives:

Profile data

Generate basic statistics about the asset content, assigns and suggests data classes.

This type of profiling is fast but makes some approximations for certain metrics like frequency distribution and uniqueness. To get more exact results without approximation, run advanced profiling on selected data assets. See Advanced data profiling. For more information about the statistics, see Detailed profiling results.

Data classes describe the contents of the data in the column: for example, city, account number, or credit card number. Data classes can be used to mask data with data protection rules or to restrict access to data assets with policies. In addition, they can contribute to term assignments if a corresponding data class to term linkage exists.

The confidence of a data class is the percentage of nonnull values that match the data class. The confidence score for a data class to be assigned or suggested must at least equal the set threshold. See Data class assignment settings. If a threshold is set on a data class directly, this threshold takes precedence when data classes are assigned. It is not considered for suggestions. In addition to the confidence score, the priority of a data class is taken into account.

Several data classes are more generic identifiers that are detected and assigned at a column level. These data classes are assigned when a more specific data class could not be identified at a value level. Generic identifiers always have a confidence of 100% and include the following data classes: code, date, identifier, indicator, quantity, and text.

Single-column primary keys are suggested based on profiling statistics. If primary key and foreign key constraints are already defined in your data and this information is included in the metadata import, these keys are automatically assigned.

From the enrichment results, you can run a multi-column primary key analysis where the actual data is checked. For more information, see Identifying primary keys.

Expand metadata

Generate semantic names and descriptions for data assets and columns. The names that exist in the source are expanded based on the collected metadata and a predefined glossary by using fuzzy matching and by comparing the names to business term abbreviation in the categories selected for the enrichment. If the asset or column name in the source can be matched to a business term abbreviation, the corresponding business term is used as the display name. Generative AI is used to provide descriptions based on the expanded names, surrounding columns, and the context of the data assets. Use this option to provide alternative names that are easier to consume than the often very technical original names. AI-generated descriptions can help to understand the content especially when column or data asset descriptions are missing in the data source. The assignment and suggestion thresholds are defined in the default enrichment settings.

Assign terms and classifications

Automatically assign business terms to columns and entire assets, or suggest business terms for manual assignment. Those assignments or suggestions are generated by a set of services. See Automatic term assignment.

Depending on which term assignment services are active for your project, term assignment might require profiling.

In addition, assign classifications to data assets and columns based on automatically assigned terms and data classes. Classification assignment must be enabled in the default enrichment settings. Classification assignment based on data classes also requires profiling.

Run basic quality analysis

Run predefined data quality checks on the columns of a data asset. The set of checks that is applied is defined in the enrichment settings. See Basic quality analysis settings. Each check can contribute to the asset's overall data quality cores. This type of data quality analysis can be done only in combination with profiling. Therefore, the Profile data option is automatically selected when you select to analyze data quality.

You can choose whether you want to write the output of these checks to a database. If default settings exist, the sections are populated accordingly. You can overwrite the settings. If no default settings exist, configure the output and the output location. For information about which data sources are supported as output target, see column Output tables in Supported data sources. Schema and table names must follow this convention:

  • The first character for the name must be an alphabetic character.
  • The rest of the name can consist of alphabetic characters, numeric characters, or underscores.
  • The name must not contain spaces.

If you select to write the exceptions or the rows in which the issues were found (exception records) to existing tables, make sure these tables have the required format. See Data quality output.

If the connection that you pick is locked, you are asked to enter your personal credentials. This is a one-time step that permanently unlocks the connection for you.

Set relationships

Uses profiling statistics and name similarities between columns to provide primary and foreign keys and to suggest or assign relationships between assets and columns. The default enrichment settings for key relationships are applied. This type of relationship analysis requires profiling.

Category selection

Select categories to determine the data classes and business terms that can be applied during the enrichment. A project administrator might have limited the set of categories to choose from when you create an enrichment. This limitation does not apply when you edit the enrichment. In any case, you can choose only from categories where you are a collaborator with at least the Viewer role.

Select only categories with governance artifacts that are relevant for your use case.

This selection applies to automatic assignments and suggestions only. When you manually assign terms or data classes, you can choose from all categories to which you have access.

Changes to the set of categories to choose from or the actual category selection take effect with the next enrichment run. However, existing assignments remain unchanged.

If your access to any of the selected categories is revoked after you ran the metadata enrichment and you don’t make any changes to the enrichment, any rerun still considers all selected categories for data class and term assignments.

Sampling

You can choose from these sampling types:

Basic
Basic sampling works with the smallest possible sample size to speed up the process: 1,000 rows per table are analyzed, and classification is done based on the most frequent 100 values per column.
Moderate
Moderate sampling works with a medium-sized sample size to provide reasonably accurate results without being too time-consuming: 10,000 rows per table are analyzed, and classification is done based on the most frequent 100 values per column.
Comprehensive
Comprehensive sampling works with a large sample size to provide more accurate results: 100,000 rows per table are analyzed, and classification takes all values per column into account. However, this method is time and resource intensive.
Custom
Define the sampling method, the sample size, and the basis for classification yourself:
  • Choose between sequential and random sampling. With sequential sampling, the first rows of a data set are selected in a sequential order. With random sampling, the rows to be included are randomly selected. For both methods, the maximum number of rows to be selected is determined by the defined sample size. Random sampling is available only for data assets from data sources that support this type of sampling.

  • Define the maximum size of the sample. You can set a fixed number of rows or specify how many percent of the rows in the data set you want to be analyzed. If you define the sample size as a percentage value, you can optionally set the minimum and maximum number of rows that the sample can include. You might want to set these values when you don't know the size of the data sets to be analyzed. The number or percentage of rows selected for the sample can only approximate the specified value.

    If the data source does not support fetching the actual record count of a data set, only a subset of the sampling options is available.

  • Select whether you want a data class to be assigned based on all values in a column or on the most frequent values in a column where you can specify the number of values you want to be taken into account.

Basic, moderate, or comprehensive sampling is sequential and starts at the top of the table. To suppress sampling, use custom sampling that is configured with random sampling and a sample size of 100%.

Scheduling options

If your data scope includes metadata import assets, the Schedule page provides information about any configured schedules of the respective metadata import jobs. This information helps you coordinate your enrichment schedule with any import schedules.

The default name of the enrichment job is metadata_enrichment_name job. You can change the name to fit your naming schema.

You can access the enrichment job that you create from within the metadata enrichment asset or from the Jobs page in the project. This page also provides easy access to the job logs. See Jobs.

Run definition

Define when the metadata enrichment is run. You can select none, one, or both of these options:

Run after job creation

Select this option to run the metadata enrichment when you save a newly created metadata enrichment. Otherwise, the metadata enrichment asset is saved, but no job run is initiated.

Run on a schedule

Select this option to run the enrichment on a schedule. You can schedule single and recurring runs. Define the start date and time for the schedule. If you schedule a single run, the job runs exactly one time at the specified day and time.

To schedule recurring runs, select Repeat the job and the frequency in which you want the enrichment job to run. If you select Minutely, Hourly, or Daily, you can exclude certain days of the week from the schedule. Optionally, you can set an end date and time for the job schedule. For recurring runs, the job runs for the first time at the timestamp that is calculated based on the settings in the Repeat the job section.

Regardless of the run definition, you can manually trigger a run of the metadata enrichment job at any time.

Learn more

Parent topic: Managing metadata enrichment

Generative AI search and answer
These answers are generated by a large language model in watsonx.ai based on content from the product documentation. Learn more