This looks very neat. I'm someone who deals with a lot of plaintext data from a variety of sources, and so I find using ack/grep and csvkit to be efficient enough for my purposes of exploration. I love using SQL and SQLite but rarely do it for "fun" -- that is, I'll use it when I've committed to building a project, but not for exploration. This seems like it could lighten the friction quite a bit.
If anyone from AWS is here: how is this used internally at Amazon?
The real question to ask is, will Amazon contribute back to open source? Presto itself is plenty proven and scalable: after all, it was created at Facebook.
It looks really interesting but I'm surprised they launched it with the create table flow broken. The query you see here was generated by their wizard...
I don't know why you are getting downvoted. For all those data formats you have to painstakingly make table schemas for them before you can query them. Not like Snowflake or BigQuery. One of the biggest strikes against Presto IMHO.
It's not Presto per se, but running any data processing workload against unoptimized data formats is the issue.
Then again, both BigQuey and Snowflake require that you move data into their storage engine (Redshift too), and that's an additional step that's proportional to the size and complexity of your data. At the same time, it's stupid to store your logs as OLAP optimized formats and completely lose legibility. In sum, Athena trades off performance for convenience.
No matter what database vendors say, you can't defy the principles of computer science.
You don't replace them with an OLAP format, you can pair them with an OLAP engine to aggregate, filter, or analyze. Elastic Search and Splunk are one approach, SQL query engines are another.
Apache Drill is a schema discovery on read approach that can handle some of this. Its not perfect, but it does simplify some of the process where its capabilities fit the task at hand.
TFA states: "Amazon Athena uses Presto with ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, and Parquet."
> Q: What data formats does Amazon Athena support?
> Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or Textfiles and also supports open source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, and GZIP formats. By compressing, partitioning, and using columnar formats you can improve performance and reduce your costs.
Would be useful if AVRO files were supported. This was the data can also be imported into Redshift if needed (Redshift does support Avro).
Other formats are schema-less (JSON,CSV, etc.) or not supported by Redshift (ORC, Parquet). Perhaps less efficient for some queries (AVRO is not a columnar format) but still useful.
How does that work though, so say my bucket has 10,000 json files in it and I want all of them with the name attributes being like '%john'. Is that possible?
Just like with BigQuery, a carefully thought out partitioning scheme is critical, or your queries need to be carefully locked down to prevent excessive table scanning. I burned through my BigQuery trial credit fast, by not using partitions during a quick-and-dirty test.
Wondering if I could use this like SQLite for Lambdas. I'd like to build some serverless apps, but the commitment to a monthly fee from DynamoDB puts me off. Could I use Athena to drive down my cost to zero as long as the app is unused?
Note the 10MB minimum "charge" per Query. For small datasets under 10MB, you'd only get up to 200 Queries for the minimum billable $0.01. That would be a fairly small number of queries, so probably not that useful. Plus you'd have all kinds of issues regarding consistency if your data was dynamic (s3 is a blob store, not a database, normal s3 consistency guarantees still apply).
I'm confused though. The monthly fees for dynamodb only apply after you exceed the free tier, and for someone who is unable to commit to a monthly fee because they envision low usage, shouldn't the free tier be sufficient? (Honest question, I'm looking at using dynamodb, but comments like this make me think I'm missing something)
DynamoDB is like $5 or $10 bucks a month? but I understand the need to keep it to a minimum.
Athena is really interesting and if it can be as it is advertised "Serverless SQL" then they've got a killer product in the pipes: A future where developers no longer need to spend time on scaling, configuring, maintaining, strategizing deployments but upload code and instantly begin reaping the benefits of serverless tech.
The only missing component that would be a killer feature is something that answers to Azure's Active Directory. It would be nice if we had serverless plug-and-play user authentication and access control that integrated with Lambda and Athena.
I'd imagine some sort of "RoR on Serverless" type of framework that will scaffold out CRUD, User Management & REST Api is going to be in the works as well.
The only potential downside I see at the moment for Serverless is the uncertainty surrounding cold boots, it will directly affect user experience. It's fine when you got enough traffic to keep things in the "warm" state but there needs to be no dead zone when the call to the API Gateway is taking many seconds waiting for Lambda function to fire.
Re: users auth. Isn't that what Cognito is supposed to be? I mean, I don't fully understand it, but I think so.
As for the cold boot issue, I thought the standing solution was to have a "fast-exit" ping-like code-path within the lambda. Query it on a regular basis (you can even do it with a lambda scheduled-event). That way your lambda should be kept warm.
What's your usecase for Athena (Really curious how this changes anything)?
You can hook your lambda up to a cloudwatch scheduled event (http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/Sc...). If you don't do much when this happens (like, you exit immediately), then you won't be charged much compute time. Can't find the reference to how often you should ping it to keep it warm, probably varies, IIRC hourly would be plenty.
building user management, security has always been a "build your own wheel". While I think a properly secured SQL table with encrypted keywords is certainly possible the risk remains. We see even for prolific and large websites their entire user tables are dumped online on a regular basis.
Off-loading this burden on a cloud vendor whose sales are directly tied to the security, gives me peace of mind. Ashley Madison had no financial incentive to keep their user base as secure as possible or actively defend against it using R&D. Yahoo Mail as well. But AWS and Active Directory are products that sell this security and for them to fuck up on an epic scale would mean their end.
so the tldr is piggybacking on the tremendous resources from a cloud giant which frees up resources to focus on the core product.
This is not to say that it's impossible to roll out your own wheel, I'm just saying it makes more sense to align the financial incentives of vendors maintaining the user base security. This may or may not mean I'm open to focusing on companies solely focused on user auth/management products, it's tough to beat the branding and trust built into AWS & Azure that developers are voting everyday with their money & data.
My solution currently is : Aws API gateway + Cognito + lambda + DynamoDB for webservice. S3 for html/css/js. CodeDeploy + Cloudformation for deployment.
Athena does not gurantee for timing, use it for async call and offline processing
Not OP, but athena returns results for most queries in a couple of seconds (quote is somewhere in the blogpost) this would likely not be enough for your typical request/response flows.
To me the obvious use case is querying your log files as stored on s3. Query for a specific combination of features, or do some (simple) processing on them.
It's really only useful for a small list of file formats. Doesn't really do much for you if you primarily use s3 for binary data or static web hosting.
If anyone from AWS is here: how is this used internally at Amazon?