{
"path": "/posts/2016/2016-04-11-querying-s3-with-presto",
"site": "at://did:plc:mracrip6qu3vw46nbewg44sm/site.standard.publication/self",
"tags": [
"code",
"presto",
"s3"
],
"$type": "site.standard.document",
"title": "Querying S3 with Presto",
"updatedAt": "2016-04-11T17:19:00.000Z",
"publishedAt": "2016-04-11T17:19:00.000Z",
"textContent": "Querying S3 with Presto\n\nThis post assumes you have an AWS account and a Presto instance (standalone or cluster) running. We'll use the Presto CLI to run the queries against the Yelp dataset. The dataset is a JSON dump of a subset of Yelp's data for businesses, reviews, checkins, users and tips.\n\nConfigure Hive metastore\n\nConfigure the Hive metastore to point at our data in S3. We are using the docker container inmobi/docker-hive\n\nModify /usr/local/hadoop/etc/hadoop/core-site.xml and add the following so we can connect to S3:\n\nRun Hive and CREATE an EXTERNAL TABLE that points to to S3. Note: supply the path to the S3 folder container the .json file. Here, we create a relational-like table out of the JSON, which we will unpack with Presto.\n\nConfigure Presto to read from Hive\n\nSpecify a properties file for Presto to use to connect to Hive.\n\nhive.properties\n\nSave and close this file and distribute it to the catalog folder of the coordinator and all workers. Then restart the coordinator and workers:\n\nQuery S3 with Presto\n\nOpen the Presto shell on the coordinator:\n\nLet's find the reviews with the most \"funny\" votes in the dataset.\n\nThis should give a nice intro to querying S3 and using some of Presto's tools to work with JSON.",
"canonicalUrl": "https://www.danielcorin.com/posts/2016/2016-04-11-querying-s3-with-presto"
}