Amazon Athena interview questions

16. Can I use Amazon Athena to query data that I process using Amazon EMR?
Yes, Amazon Athena supports many of the same data formats as Amazon EMR. Athena’s data catalog is Hive metastore compatible. If you’re using EMR and already have a Hive metastore, you simply execute your DDL statements on Amazon Athena, and then you can start querying your data right away without impacting your Amazon EMR jobs.

17. Are there any charges to test preview features?
During the preview, you are not charged for the data scanned from federated data sources. However, you are charged standard Athena rates for data scanned from Amazon S3. Additionally, you are charged standard rates for the AWS services that you use with Athena, such as Amazon S3, AWS Lambda, AWS Glue, Amazon SageMaker, and AWS Serverless Application Repository.

18. Why should I upgrade to AWS Glue Data Catalog?
AWS Glue is a fully managed ETL service. Glue has three main components: 1) a crawler that automatically scans your data sources, identifies data formats and infers schemas, 2) a fully managed ETL service that allows you to transform and move data to various destinations, and 3) a Data Catalog that stores metadata information about databases & tables either stored in S3 or an ODBC- or JDBC-compliant data store. To use the benefits of Glue, you must upgrade from using Athena’s internal Data Catalog to the Glue Data Catalog.
The benefits of upgrading to the Glue Data Catalog are:
Unified Metadata Repository: AWS Glue is integrated across a wide range of AWS services. AWS Glue supports data stored in Amazon Aurora, Amazon RDS MySQL, Amazon RDS PostreSQL, Amazon Redshift, and Amazon S3, as well as MySQL and PostgreSQL databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2. AWS Glue provides out-of-the-box integration with Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, and any Apache Hive Metastore-compatible application.
Automatic schema and partition recognition: AWS Glue automatically crawls your data sources, identifies data formats, and suggests schemas and transformations. Crawlers can help automate table creation and automatic loading of partitions.
Easy to build pipelines: AWS Glue’s ETL engine generates Python code that is customizable, reusable, and portable. You can edit the code using your favorite IDE or notebook and share it with others using GitHub. Once your ETL job is ready, you can schedule it to run on AWS Glue’s fully managed, scale-out Spark infrastructure. AWS Glue is serverless, so it handles provisioning, configuration, and scaling of the resources required to run your ETL jobs, allowing you to tightly integrate ETL in your workflow.

19. What is the underlying technology behind Amazon Athena?
Amazon Athena uses Presto with full standard SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet. Athena can handle complex analysis, including large joins, window functions, and arrays. Because Amazon Athena uses Amazon S3 as the underlying data store, it is highly available and durable with data redundantly stored across multiple facilities and multiple devices in each facility.

20. Why should you use federated queries in Athena?
Developers often pick relational, key-value, document, in-memory, search, graph, time-series and ledger databases along with storing their data on S3. Running analytics on data spread across wide variety of data sources can be complex and time consuming. Analysts often have to learn new programming languages and database constructs, and build complex pipelines that extract, transform, and create copies of data before they can analyze them. Similarly, data scientists often need to extract data from multiple data sources to create a data set fit for feature extraction and training. This process is time consuming and inhibits building self-service platforms where analysts and data scientists can easily build pipelines that can extract data from multiple source. Analysts typically have to depend on data engineering teams to build such pipelines, introducing delays and complexity. Federated query eliminates this complexity by providing a simple to use, pay-per-query, serverless service that allows you to run SQL queries across a variety of such data stores. You can use well-known SQL constructs to query data across multiple data sources for quick analysis, or use scheduled SQL queries to extract and transform data from multiple data sources, and store them in S3 for further analysis.

Author: user

Leave a Reply