Querying S3 with Presto
Dan Corin
April 11, 2016
Querying S3 with Presto
This post assumes you have an AWS account and a Presto instance (standalone or cluster) running. We'll use the Presto CLI to run the queries against the Yelp dataset. The dataset is a JSON dump of a subset of Yelp's data for businesses, reviews, checkins, users and tips.
Configure Hive metastore
Configure the Hive metastore to point at our data in S3. We are using the docker container inmobi/docker-hive
Modify /usr/local/hadoop/etc/hadoop/core-site.xml and add the following so we can connect to S3:
Run Hive and CREATE an EXTERNAL TABLE that points to to S3. Note: supply the path to the S3 folder container the .json file. Here, we create a relational-like table out of the JSON, which we will unpack with Presto.
Configure Presto to read from Hive
Specify a properties file for Presto to use to connect to Hive.
hive.properties
Save and close this file and distribute it to the catalog folder of the coordinator and all workers. Then restart the coordinator and workers:
Query S3 with Presto
Open the Presto shell on the coordinator:
Let's find the reviews with the most "funny" votes in the dataset.
This should give a nice intro to querying S3 and using some of Presto's tools to work with JSON.
Discussion in the ATmosphere