Deploy DataHub utilizing AWS managed companies and ingest metadata from AWS Glue and Amazon Redshift – Half 2



Within the first submit of this sequence, we mentioned the necessity of a metadata administration resolution for organizations. We used DataHub as an open-source metadata platform for metadata administration and deployed it utilizing AWS managed companies with the AWS Cloud Improvement Package (AWS CDK).

On this submit, we concentrate on populate technical metadata from the AWS Glue Information Catalog and Amazon Redshift into DataHub, and increase information with a enterprise glossary and visualize information lineage of AWS Glue jobs.

Overview of resolution

The next diagram illustrates the answer structure and its key parts:

  1. DataHub runs on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster, utilizing Amazon OpenSearch Service, Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Amazon RDS for MySQL because the storage layer for the underlying information mannequin and indexes.
  2. The answer pulls technical metadata from AWS Glue and Amazon Redshift to DataHub.
  3. We enrich the technical metadata with a enterprise glossary.
  4. Lastly, we run an AWS Glue job to rework the info and observe the info lineage in DataHub.

Within the following sections, we display ingest the metadata utilizing numerous strategies, enrich the dataset, and seize the info lineage.

Pull technical metadata from AWS Glue and Amazon Redshift

On this step, we have a look at three totally different approaches to ingest metadata into DataHub for search and discovery.

DataHub helps each push-based and pull-based metadata ingestion. Push-based integrations (for instance, Spark) permit you to emit metadata instantly out of your information programs when metadata adjustments, whereas pull-based integrations permit you to extract metadata from the info programs in a batch or incremental-batch method. On this part, you pull technical metadata from the AWS Glue Information Catalog and Amazon Redshift utilizing the DataHub net interface, Python, and the DataHub CLI.

Ingest information utilizing the DataHub net interface

On this part, you employ the DataHub net interface to ingest technical metadata. This methodology helps each the AWS Glue Information Catalog and Amazon Redshift, however we concentrate on Amazon Redshift right here as an indication.

As a prerequisite, you want an Amazon Redshift cluster with pattern information, accessible from the EKS cluster internet hosting DataHub (default TCP port 5439).

Create an entry token

Full the next steps to create an entry token:

  1. Go to the DataHub net interface and select Settings.
  2. Select Generate new token.
  3. Enter a reputation (GMS_TOKEN), optionally available description, and expiry date and time.
  4. Copy the worth of the token to a protected place.

Create an ingestion supply

Subsequent, we configure Amazon Redshift as our ingestion supply.

  1. On the DataHub net interface, select Ingestion.
  2. Select Generate new supply.
  3. Select Amazon Redshift.
  4. Within the Configure Recipe step, enter the values of host_port and database of your Amazon Redshift cluster and preserve the remaining unchanged:
# Coordinates thing.<area>
database: dev

The values for ${REDSHIFT_USERNAME}, ${REDSHIFT_PASSWORD}, and ${GMS_TOKEN} reference secrets and techniques that you just arrange within the subsequent step.

  1. Select Subsequent.
  2. For the run schedule, enter your required cron syntax or select Skip.
  3. Enter a reputation for the info supply (for instance, Amazon Redshift demo) and select Carried out.

Create secrets and techniques for the info supply recipe

To create your secrets and techniques, full the next steps:

  1. On the DataHub Handle Ingestion web page, select Secrets and techniques.
  2. Select Create new secret.
  3. For Title¸ enter REDSHIFT_USERNAME.
  4. For Worth¸ enter awsuser (default admin person).
  5. For Description, enter an optionally available description.
  6. Repeat these steps for REDSHIFT_PASSWORD and GMS_TOKEN.

Run metadata ingestion

To ingest the metadata, full the next steps:

  1. On the DataHub Handle Ingestion web page, select Sources.
  2. Select Execute subsequent to the Amazon Redshift supply you simply created.
  3. Select Execute once more to verify.
  4. Increase the supply and watch for the ingestion to finish, or verify the error particulars (if any).

Tables within the Amazon Redshift cluster at the moment are populated in DataHub. You possibly can view these by navigating to Datasets > prod > redshift > dev > public > customers.

You’ll additional work on enriching this desk metadata utilizing the DataHub CLI in a later step.

Ingest information utilizing Python code

On this part, you employ Python code to ingest technical metadata to the DataHub CLI, utilizing the AWS Glue Information Catalog for example information supply.

As a prerequisite, you want a pattern database and desk within the Information Catalog. You additionally want an AWS Id and Entry Administration (IAM) person with the required IAM permissions:

    "Impact": "Enable",
    "Motion": [
    "Useful resource": [

Word the GMS_ENDPOINT worth for DataHub by working kubectl get svc, and find the load balancer URL and port quantity (8080) for the service datahub-datahub-gms.

Set up the DataHub consumer

To put in the DataHub consumer with AWS Cloud9, full the next steps:

  1. Open the AWS Cloud9 IDE and begin the terminal.
  2. Create a brand new digital atmosphere and set up the DataHub consumer:
# Set up the virtualenv
python3 -m venv datahub
# Activate the virtualenv
Supply datahub/bin/activate
# Set up/improve datahub consumer
pip3 set up --upgrade acryl-datahub

  1. Test the set up:

If DataHub is efficiently put in, you see the next output:

DataHub CLI model:
Python model: 3.X.XX (default,XXXXX)

  1. Set up the DataHub plugin for AWS Glue:
pip3 set up --upgrade 'acryl-datahub[glue]'

Put together and run the ingestion Python script

Full the next steps to ingest the info:

  1. Obtain from the GitHub repository.
  2. Edit the values of each the supply and sink objects:
from import Pipeline

pipeline = Pipeline.create(
        "supply": {
            "sort": "glue",
            "config": {
                "aws_access_key_id": "<aws_access_key>",
                "aws_secret_access_key": "<aws_secret_key>",
                "aws_region": "<aws_region>",
                "emit_s3_lineage" : False,
        "sink": {
            "sort": "datahub-rest",
            "config": {
                "server": "http://<>",
                 "token": "<your_gms_token_string>"

# Run the pipeline and report the outcomes.

For manufacturing functions, use the IAM function and retailer different parameters and credentials in AWS Techniques Supervisor Parameter Retailer or AWS Secrets and techniques Supervisor.

To view all configuration choices, discuss with Config Particulars.

  1. Run the script inside the DataHub digital atmosphere:

In the event you navigate again to the DataHub net interface, the databases and tables in your AWS Glue Information Catalog ought to seem underneath Datasets > prod > glue.

Ingest information utilizing the DataHub CLI

On this part, you employ the DataHub CLI to ingest a pattern enterprise glossary about information classification, private data, and extra.

As a prerequisite, you could have the DataHub CLI put in within the AWS Cloud9 IDE. If not, undergo the steps within the earlier part.

Put together and ingest the enterprise glossary

Full the next steps:

  1. Open the AWS Cloud9 IDE.
  2. Obtain business_glossary.yml from the GitHub repository.
  3. Optionally, you’ll be able to discover the file and add customized definitions (discuss with Enterprise Glossary for extra data).
  4. Obtain business_glossary_to_datahub.yml from the GitHub repository.
  5. Edit the total path to the enterprise glossary definition file, GMS endpoint, and GMS token:
  sort: datahub-business-glossary
    file: /residence/ec2-user/atmosphere/business_glossary.yml    

  sort: datahub-rest 
    server: 'http://<>'
    token:  '<your_gms_token_string>'

  1. Run the next code:
datahub ingest -c business_glossary_to_datahub.yml

  1. Navigate again to the DataHub interface, and select Govern, then Glossary.

It is best to now see the brand new enterprise glossary to make use of within the subsequent part.

Enrich the dataset with extra metadata

On this part, we enrich a dataset with extra context, together with description, tags, and a enterprise glossary, to assist information discovery.

As a prerequisite, observe the sooner steps to ingest the metadata of the pattern database from Amazon Redshift, and ingest the enterprise glossary from a YAML file.

  1. Within the DataHub net interface, browse to Datasets > prod > redshift > dev > public > customers.
  2. Beginning on the desk degree, we add associated documentation and a hyperlink to the About part.

This permits analysts to grasp the desk relationships at a look, as proven within the following screenshot.

  1. To additional improve the context, add the next:
    • Column description.
    • Tags for the desk and columns to help search and discovery.
    • Enterprise glossary phrases to prepare information property utilizing a shared vocabulary. For instance, we outline userid within the USERS desk as an account in enterprise phrases.
    • House owners.
    • A area to group information property into logical collections. That is helpful when designing a information mesh on AWS.

Now we will search utilizing the extra context. For instance, looking for the time period electronic mail with the tag tickit appropriately returns the USERS desk.

We will additionally search utilizing tags, reminiscent of tags:"PII" OR fieldTags:"PII" OR editedFieldTags:"PII".

Within the following instance, we search utilizing the sphere description fieldDescriptions:The person's residence state, reminiscent of GA.

Be at liberty to discover the search options in DataHub to reinforce the info discovery expertise.

Seize information lineage

On this part, we create an AWS Glue job to seize the info lineage. This requires use of a datahub-spark-lineage JAR file as an extra dependency.

  1. Obtain the NYC yellow taxi journey information for 2022 January (in parquet file format) and reserve it underneath s3://<<Your S3 Bucket>>/tripdata/.
  2. Create an AWS Glue crawler pointing to s3://<<Your S3 Bucket>>/tripdata/ and create a touchdown desk referred to as landing_nyx_taxi contained in the database nyx_taxi.
  3. Obtain the datahub-spark-lineage JAR file (v0.8.41-3-rc3) and retailer it in s3://<<Your S3 Bucket>>/externalJar/.
  4. Obtain the file and retailer it in s3://<<Your S3 Bucket>>/externalJar/.
  5. Create a goal desk utilizing the next SQL script.

The AWS Glue job reads the info in parquet file format utilizing the touchdown desk, performs some primary information transformation, and writes it to focus on desk in parquet format.

  1. Create an AWS Glue Job utilizing the next script and modify your GMS_ENDPOINT, GMS_TOKEN, and supply and goal database desk identify.
  2. On the Job particulars tab, present the IAM function and disable job bookmarks.

  1. Add the trail of datahub-spark-lineage (s3://<<Your S3 Bucket>>/externalJar/datahub-spark-lineage-0.8.41-3-rc3.jar) for Dependent JAR path.
  2. Enter the trail of for Referenced recordsdata path.

The job reads the info from the touchdown desk as a Spark DataFrame after which inserts the info into the goal desk. The JAR is a light-weight Java agent that listens for Spark utility job occasions and pushes metadata out to DataHub in actual time. The lineage of datasets which are learn and written is captured. Occasions reminiscent of utility begin and finish, and SQLExecution begin and finish are captured. This data might be seen underneath pipelines (DataJob) and duties (DataFlow) in DataHub.

  1. Run the AWS Glue job.

When the job is full, you’ll be able to see the lineage data is being populated within the DataHub UI.

The previous lineage exhibits the info is being learn from a desk backed by an Amazon Easy Storage Service (Amazon S3) location and written to an AWS Glue Information Catalog desk. The Spark run particulars like question run ID are captured, which might be mapped again to the Spark UI utilizing the Spark utility identify and Spark utility ID.

Clear up

To keep away from incurring future costs, full the next steps to delete the sources:

  1. Run helm uninstall datahub and helm uninstall stipulations.
  2. Run cdk destroy --all.
  3. Delete the AWS Cloud9 atmosphere.


On this submit, we demonstrated search and uncover information property saved in your information lake (by way of the AWS Glue Information Catalog) and information warehouse in Amazon Redshift. You possibly can increase information property with a enterprise glossary, and visualize the info lineage of AWS Glue jobs.

In regards to the Authors

Debadatta Mohapatra is an AWS Information Lab Architect. He has in depth expertise throughout large information, information science, and IoT, throughout consulting and industrials. He’s an advocate of cloud-native information platforms and the worth they’ll drive for purchasers throughout industries.

Corvus Lee is a Options Architect for AWS Information Lab. He enjoys every kind of data-related discussions, and helps prospects construct MVPs utilizing AWS databases, analytics, and machine studying companies.

Suraj Bang is a Sr Options Architect at AWS. Suraj helps AWS prospects on this function on their Analytics, Database and Machine Studying use instances, architects an answer to unravel their enterprise issues and helps them construct a scalable prototype.