Data governance tutorial: Curate high quality data
Take this tutorial to learn how to prepare trusted data with the Data governance use case of the data fabric trial. Your goal is to create trusted data assets by enriching your data and running data quality analysis.
The story for the tutorial is that Golden Bank has several departments that need access to high-quality customer mortgage data. As a Data Steward on the governance team, you must sort and organize the company's data to provide high-quality and protected data assets that data consumers can easily find in a self-service catalog.
The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial where you will import metadata from an external data source, enrich that data with auto-assigned business terms, view the enriched data, and publish the enriched data to a catalog. Click the image to view a larger image.
Preview the tutorial
In this tutorial, you will complete these tasks:
- Set up the prerequisites.
- Task 1: Create a catalog.
- Task 2: Create a category.
- Task 3: Add business terms.
- Task 4: Import data into the project.
- Task 5: Enrich the data.
- Task 6: View the results of the metadata enrichment.
- Task 7: Publish assets to a catalog.
Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial.
This video provides a visual method to learn the concepts and tasks in this documentation.
Tips for completing this tutorial
Here are some tips for successfully completing this tutorial.
Use the video picture-in-picture
The following animated image shows how to use the video picture-in-picture and table of contents features:
Get help in the community
If you need help with this tutorial, you can ask a question or find an answer in the Cloud Pak for Data Community discussion forum.
Set up your browser windows
For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.
Set up the prerequisites
Sign up for Cloud Pak for Data as a Service
You must sign up for Cloud Pak for Data as a Service and provision the necessary services for the Data integration use case.
- If you have an existing Cloud Pak for Data as a Service account, then you can get started with this tutorial. If you have a Lite plan account, only one user per account can run this tutorial.
- If you don't have a Cloud Pak for Data as a Service account yet, then sign up for a data fabric trial.
Watch the following video to learn about data fabric in Cloud Pak for Data.
This video provides a visual method to learn the concepts and tasks in this documentation.
Verify the necessary provisioned services
To preview this task, watch the video beginning at 01:05.
Follow these steps to verify or provision the necessary services:
-
From the Navigation Menu , choose Services > Service instances.
-
Use the Product drop-down list to determine whether a IBM Knowledge Catalog service instance exists.
-
If you need to create a IBM Knowledge Catalog service instance, click Add service.
-
Select IBM Knowledge Catalog.
-
Select the Lite plan.
-
Click Create.
-
-
Repeat these steps to verify or provision the Cloud Object Storage service.
Check your progress
The following image shows the provisioned service instances:
Create the sample project
To preview this task, watch the video beginning at 01:38.
If you did not already create the sample project for this tutorial, follow these steps:
-
Access the Data governance sample project in the Resource hub.
-
Click Create project.
-
If prompted to associate the project to a Cloud Object Storage instance, select a Cloud Object Storage instance from the list.
-
Click Create.
-
Wait for the project import to complete, and then click View new project to verify that the project and assets were created successfully.
-
Click the Assets tab to view the project's assets.
-
From the Overflow menu at the end of the Banking.csv data asset row, choose Download, and save it to your computer. You'll use that file in a later step.
Check your progress
The following image shows the Assets tab in the sample project. You are now ready to start the tutorial.
Task 1: Create a catalog
To preview this task, watch the video beginning at 02:49.
Before you start working with data, create a catalog where you will publish data to share it with your organization. With the IBM Knowledge Catalog Lite plan, you can create only two catalogs. If you already have a catalog, you can skip this step. Otherwise, follow these steps to create a catalog:
-
From the Navigation Menu , choose Catalogs > View all catalogs.
-
If you see a catalog on the Catalogs page, then skip to Task 2: Create a category. Otherwise, follow these steps to create a new catalog:
-
Click Create Catalog.
-
For the Name, copy and paste the catalog name exactly as shown with no leading or trailing spaces:
Mortgage Approval Catalog
-
Select Enforce data protection rules, confirm the selection, and accept the defaults for the other fields.
-
Click Create.
Check your progress
The following image shows your catalog. You are now ready to share assets with your organization.
Task 2: Create a category
To preview this task, watch the video beginning at 03:13.
You need a category to contain the business terms that you’ll import in the next Task. Categories act like folders to organize your governance artifacts and the people who can author and manage those artifacts. Follow these steps to create a category:
-
From the Cloud Pak for Data navigation menu , choose Governance > Categories.
-
Click Add category > New category.
-
For the name, type
Banking
. -
Click Create.
Check your progress
The following image shows the Banking category. You are now ready to import business terms.
Task 3: Add business terms
To preview this task, watch the video beginning at 03:41.
Now import business terms into the new category. You’ll use them to enrich your data assets in a later step. Business terms are standardized definitions of business concepts so that your data is described in a uniform and easily understood way across your enterprise. Follow these steps to import the business terms from a file:
-
From the Cloud Pak for Data navigation menu , choose Governance > Business terms.
-
Click Add business term > Import from file.
-
Click Drag and drop file here or upload.
-
Select the banking.csv file that you downloaded earlier.
-
Click Open.
-
-
Click Next.
-
Select Replace all values, and click Next.
-
Click Go to task to see the draft business terms. If you miss the notification, then from the Cloud Pak for Data navigation menu , choose Governance > Task inbox.
-
Select the Publish business terms checkbox, and then click Publish. Click Publish to confirm.
-
From the Cloud Pak for Data navigation menu , choose Governance > Business terms to view the published business terms.
Check your progress
The following image shows the imported business terms. You are now ready to import the data to a project and then enrich with the imported business terms.
Task 4: Import data to a project
To preview this task, watch the video beginning at 04:47.
The sample project includes a connection to a Db2 Warehouse instance, which contains the mortgage assets. You can import technical metadata that is associated with the data assets into a project or a catalog to inventory, evaluate, and catalog these assets. Technical metadata describes the structure of data objects. Follow these steps to import the data assets:
-
From the Navigation Menu , choose Projects > View all projects.
-
Click the Data governance project.
-
Click the Assets tab.
-
Click New asset > Import metadata for data assets.
-
For the name, copy and paste the following text:
Mortgage data - metadata import
-
Click Next to continue.
-
On the Select target page, select This project, and click Next to continue.
-
On the Select scope page, click Select connection.
-
Select the Data Fabric Trial - Db2 Warehouse connection.
-
Select the checkbox next to the WKC_MORTGAGE schema, then click the WKC_MORTGAGE schema name.
-
Select the following tables:
- COMMERCIAL_CLIENT
- CREDIT_SCORE
- HOUSE_PRICE
- MORTGAGE_APPLICANTS
- MORTGAGE_APPLICATION
-
Review the list of assets in the side panel, and then click Select.
-
-
Click Next to continue to the schedule. You can manually run the metadata enrichment, so keep the scheduled turned off.
-
Click Next to continue to the Advanced Options.
-
Accept the default values for on the Advanced options page, and click Next to continue to the review.
-
Review the summary of the import, and click Create. The metadata import job starts.
-
Click the Refresh icon to watch the status change from Queued to In progress to Imported. When the job run is complete, you see the five assets listed.
Check your progress
The following image shows the completed metadata import. Your next task is to enrich the imported data assets with the imported business terms.
Task 5: Enrich the imported data
To preview this task, watch the video beginning at 06:07.
You can enrich data assets with information that helps users to find data faster to decide whether the data is appropriate for the task at hand, whether they can trust the data, and how to work with the data. Such information includes, for example, terms that define the meaning of the data, rules that document ownership or determine quality standards, or reviews. Follow these steps to enrich the imported data:
-
Click the Data governance project name in the navigation trail.
-
On the Assets tab, click New asset > Enrich data assets with metadata.
-
For the name, copy and paste the following text:
Mortgage data - metadata enrichment
-
Click Next to continue.
-
Click Select data from project.
-
Select Metadata import.
-
Click the checkbox next to Mortgage data - metadata import. This asset includes the following assets:
- COMMERICIAL_CLIENT
- CREDIT_SCORE
- HOUSE_PRICE
- MORTGAGE_APPLICANTS
- MORTGAGE_APPLICATION
-
Click Select.
-
-
Click Next to continue to the enrichment objective.
-
Select all enrichment objectives:
- Profile data
- Assign terms
- Run basic quality analysis
-
For Categories, click Select categories.
-
Select only [uncategorized] and Banking.
-
Click Select.
-
-
For the Sampling, select Basic.
-
Click Next to continue to the schedule. You can manually run the import, so keep the scheduled turned off.
-
Click Next to continue to the review.
-
Click Create.
-
The metadata enrichment asset displays, but the job might take several minutes to complete. Click the Refresh icon to watch the status change from Not analyzed to In progress to Finished. When the job run is complete, you see the five assets listed.
Check your progress
The following image shows the completed metadata enrichment. Now you can explore the enriched data assets.
Task 6: View the results of the metadata enrichment
To preview this task, watch the video beginning at 07:45.
After Metadata enrichment run is completed, follow these steps to view the enriched data:
-
From the Mortgage data - metadata enrichment screen, click the Columns tab.
-
In the list of Columns, locate the EMAIL_ADDRESS column for the MORTGAGE_APPLICANTS asset.
-
At the end of the EMAIL_ADDRESS for MORTGAGE_APPLICANTS row, click the Overflow menu , and choose View column details.
-
In the side panel on the Details tab, you see profiling information such as: Format, Frequency distribution, Statistics.
-
In the side panel, click the Governance tab. This tab includes the data classes and business terms that were auto-assigned during the metadata enrichment. You might also see suggested business terms and data classes, and manually assign them.
-
Review any suggested business terms or data classes and manually assign them. For example, you may see Address as a suggested business term.
-
Click Suggested business terms.
-
For Address, click Assign.
-
-
-
At the end of the EMAIL_ADDRESS column for the MORTGAGE_APPLICANTS asset row, click the Overflow menu , and choose View data quality details.
-
View the data quality information. IBM Knowledge Catalog automatically generates a data quality score for each column and data asset by analyzing every value in every record according to pre-built dimensions.
-
Click the X to close the Data quality window.
-
-
For the CITY column for the CREDIT_SCORE asset, click the Overflow menu , and choose Mark as reviewed.
-
Click the Assets tab.
-
In the list of Assets, for the MORTGAGE_APPLICANTS asset, click the Overflow menu , and choose View asset details.
-
In the side panel, click the Governance tab to see business term auto assignment.
-
Click the Edit icon to manually assign business terms.
-
Search for
social
. If you don't see any results, then make sure that the drop-down list is set to All terms instead of Suggested terms. -
Select Social Security Number.
-
Click Assign.
-
Check your progress
The following image shows the reviewed and enriched data assets. The next step is to publish the enriched data to a catalog to share with your organization.
Task 7: Publish data to a catalog
To preview this task, watch the video beginning at 09:06.
Now that you have enriched data, you want to publish those data assets to a catalog so data scientists and data analysts can use the enriched data assets. Follow these steps to store the enriched data assets in a catalog for others to have access to the trusted data:
-
Click the Data governance project name in the navigation trail.
-
Click the Assets tab.
-
Select Data > Data assets.
-
Select the COMMERICIAL_CLIENT, HOUSE_PRICE, MORTGAGE_APPLICANTS, and MORTGAGE_APPLICATION data assets from the list, and click Publish to catalog.
-
For the Target catalog, select Mortgage Approval Catalog, and click Next.
-
For the Tags, type the tag,
trusted
, and click + (plus sign), and then click Next. -
Review the assets, and click Publish.
-
-
Clear all checked assets, then select the checkbox next to the CREDIT_SCORE asset from the list, and click Publish to catalog.
-
For the Target catalog, select Mortgage Approval Catalog, and click Next.
-
For the Tags, type the tag
confidential
, and click + (plus sign). -
Type the tag
trusted
, and click + (plus sign) to a second tag. -
Select the option to Go to the catalog after publishing it, and click Next.
-
Review the assets, and click Publish.
-
-
Filter the assets In the Mortgage Approval Catalog.
-
Click the Filter icon .
-
Expand the Tag section.
-
Select trusted, and click Apply.
-
Verify that the five data assets were added to the catalog.
-
-
Change the name for the MORTGAGE_APPLICANTS data asset.
-
Open the MORTGAGE_APPLICANTS asset.
-
Click the Edit name icon .
-
Change the name to:
MORTGAGE_APPLICANTS_TRUST
-
Click Apply.
-
Check your progress
The following image shows the enriched data assets published to a catalog. Now you have trusted data available through your company's catalog.
As a Data Steward on the governance team, you learned how to sort and organize the company's data to provide high-quality and protected data assets that data consumers can easily find in a self-service catalog.
Next steps
You are now ready to protect your data by creating data protection rules and masking flows to control access to your data. See the Protect your data tutorial.
Learn more
-
Try these tutorials:
Parent topic: Use case tutorials