AWS Glue manages dependencies between jobs using triggers. Triggers can start jobs based on the completion status of other jobs, making it possible to create ETL workflows where one job’s output is another job’s input.
Types of triggers:
Scheduled Triggers: Start jobs at specified times.
On-Demand Triggers: Start jobs manually.
Job Completion Triggers: Start jobs based on the completion status of other jobs.
Job completion triggers
To manage dependencies between jobs, Job Completion Triggers are especially important. They allow you to start jobs when the triggering jobs succeed, fail, or stop, and you can use them to set up complex job workflows with multiple dependencies.
Creating job dependencies
Scenario:
You have three jobs: JobA, JobB, and JobC.
JobB should run after the successful completion of JobA.
JobC should run after the successful completion of JobB.
Steps:
Create jobs in AWS Glue
Navigate to AWS Glue Console.
Create the three jobs, JobA, JobB, and JobC.
Create triggers
Create a trigger TriggerAB to start JobB when JobA succeeds.
Create another trigger TriggerBC to start JobC when JobB succeeds.
Python/Boto3 example
Using AWS SDK for Python (Boto3), you can create jobs and triggers as follows:
Workflow visualization
AWS Glue Console provides a visual interface to view and monitor the ETL workflows. It shows the flow of execution and the status of each job in the workflow. It is useful to monitor the jobs and troubleshoot if any job fails.
Error handling and retry logic
AWS Glue also provides options for error handling and retry logic. You can set the maximum number of retries for a job and decide what should happen if a job fails. This is essential to manage job failures and to ensure that dependent jobs are not started until the prerequisite jobs are successfully completed.
Monitoring with cloudwatch
AWS Glue jobs and triggers generate metrics, logs, and events that are monitored using Amazon CloudWatch. You can set up CloudWatch alarms to notify you if a job fails or if it takes longer than expected to run, enabling you to respond quickly to any issues in your ETL workflows.