The Data Cloud Podcast

Product Management for Data Lake, Open Source, and Storage at Snowflake with James Malone, Senior Manager of Product Management, Snowflake

Episode Summary

In this episode, James Malone, Senior Manager of Product Management at Snowflake, gives us a look behind the curtain at Snowflake. He talks about the perks of open-source, meeting customers where they are and so much more.

Episode Notes

--------

How you approach data will define what’s possible for your organization. Data engineers, data scientists, application developers, and a host of other data professionals who depend on the Snowflake Data Cloud continue to thrive thanks to a decade of technology breakthroughs. But that journey is only the beginning.

Attend Snowflake Summit 2023 in Las Vegas June 26-29 to learn how to access, build, and monetize data, tools, models, and applications in ways that were previously unimaginable. Enable seamless alignment and collaboration across these crucial functions in the Data Cloud to transform nearly every aspect of your organization.
Learn more and register at www.snowflake.com/summit

Episode Transcription

[00:00:00]

Steve Hamm: Well welcome, James. Uh, would you please start by telling it a bit about yourself, your career in the tech industry and your role at snowflake?

James Malone: Absolutely. It's great to be here. Thank you for having me. So I'm James. Uh, I lead a team of really awesome product managers at snowflake, and we focus on. All of the technologies that help customers get data in and out of snowflake, orchestrate, uh, things inside of snowflake alert on things in snowflake and store data, either inside of snowflake or outside in, uh, the public cloud or on premise, I came to snowflake about a year and a half ago.

Previous to that, I was at Google for a number of years and launched several of their publicly available cloud services. Generally, I tended to focus on either, uh, white party or white labeled products or products, which were based on managed open source. Uh, previous to [00:01:00] that, I was at Disney for a few years and led data engineering for.

The media side of Disney. And, uh, prior to that, I was at Amazon for a number of years where I had started my career and then went through a number of different, uh, positions from, uh, engineering through technical product management.

Steve Hamm: Well, your resume is very cool. We've gotta say that, you

James Malone: you.

Steve Hamm: All right. So, um, well, great. You know, most of our podcast casts a wide net, you know, we reach out to a range of people and roles anywhere from business executives on the one end to, uh, data technologists on the other. Today though, we're focusing in on a critical issue for many people in the tech community, which is managing data.

That's not stored in snowflake, but it might be stored in data lakes or other kinds of repositories. So tell us why this is such a critical issue today.[00:02:00]

James Malone: That's an excellent question. So not in snowflake can take many different forms. It can take the form of being stored outside of the public cloud or inside of the public cloud. the outside of the public cloud space is a little bit more straightforward.

Outside of the public cloud. We have the ability with external tables now to communicate with devices which have an S3 API on top, that generally makes data that's stored outside of the public cloud, accessible to snowflake. And we've seen a lot of customer interest and early use of that product because there are some cases where customers just can't or won't move data to the public cloud.

Most of the data that is not inside of snowflake, however, is inside of the public cloud. And I think in my opinion, this boils down to just how. The cloud blob stores or object stores, Amazon S3, Google cloud [00:03:00] storage, Azure storage have been, they make it very easy to store vast amounts of data at a fairly low price.

And that has spawned a whole ecosystem of products and tools. Some from the cloud providers, some software as a service, and generally just a lot of, of manual movement. SP to the whole ecosystem of tools that build on top of that paradigm of really reliable, really fast, easy, and cost effective storage.

And as a result, we see a lot of customers squirrel in tremendous amounts of data. Inside of those object stores. Typically customers have moved a portion of data to snowflake from those object stores, but a lot has remained outside of snowflake either because of cost concerns or they didn't see an immediate, uh, use case or, uh, it was data that wasn't as structured as they would expect to use with snowflake.

So we are working on a whole set. Tools [00:04:00] and technologies and solutions to make snowflake, uh, much easier to use with that vast quantity of information that happens to reside in the public

Steve Hamm: Very good. So for our business listeners, let's break some things down here. Let's start with some definitions, like here's a, a little list of things I'd like you to define object stores. Table format, data lake. And you mentioned blob storage before. What's

James Malone: Yeah. Okay. Those are, those are good questions. So let's start with object stores or blob storage generally. I'm I'm I probably shouldn't always, but I'm gonna use those, those terms interchangeably. What we're referring to there are. Generally the fundamental storage platforms in most public clouds. So Amazon S3, Google cloud storage, or Azure data lake storage, they allow you to store blobs or sets of data, very cost effectively inside of the public cloud.

And I would say it's [00:05:00] most common for most people to store their data lakes or their data meshes inside of those storage technologies. So they're designed to just allow you to store a huge amount of data, very cost effectively. A data lake is really, it's a storage architecture paradigm. Um, and it's not the only one you could go build a data mesh, uh, or a data warehouse.

And I think you see, as technology changes, there's been. The evolution of different storage architectures. And without going too far down into the weeds, I will just throw out it's pretty common for people to structure their data based on their organization. And I think that's why you see a fluidity of these architectures appear, you know, rise and fall over time and some purport to be the end.

All be all are the best. I'm not sure that's actually true. It's that different people pick D. Organizational formats or different architectures based on how their own organization is structured. [00:06:00] A table format allows you to specify a table on a set of. Files that are inside of these blob stores. To give you an example, if you have a thousand part K files inside of Amazon S3, uh, most tools are unaware.

Whether you have a thousand tables or one table that has a thousand files or two tables that are 500 files, each a table format allows you to easily. Those files to a table in terms of it schema it's, uh, partitioning it's changes or schema changes over time. And most commonly people are using those table formats to implement a data lake or a data me.

Steve Hamm: Okay. Okay. But let me ask you this. I have the impression that a data lake or data warehouse, these are for different kinds of data that a data warehouse is really for more structured data for relational data, data lake, maybe for semi-structured unstructured. Is that true or is that just [00:07:00] oversimplification?

James Malone: I think historically that has been true. And historically I think that has been a very good measure. The lines are definitely, as they often do with technology, the lines are blurring and with give you a good example, uh, with snowflake, we support unstructured data. And I think in a lot of customer's minds that really does make it possible to use, to do data analytics or to drive semi-structured to structured data off of unstructured data.

And where. A data warehouse begins and a data lake ends is not always a clear delineation line. Uh, likewise, the whole rise of table formats like Apache iceberg are based on the need to do structured or semi-structured analytics on top of data inside of blob stores. So I would say the wines are increasingly blurring over.

Steve Hamm: Okay, good. Good. Um, alright. So you mentioned earlier about [00:08:00] external tables. Three years ago, snowflake released external tables. And this is the technology that made it easier to manage engineer and analyze data. that's not stored in snowflake, but this year the company went a step further.

It announced technology to support Apache foundation's iceberg. And you also mentioned that. Tell us what iceberg is and what its benefits are.

James Malone: Absolutely. So let's talk about iceberg. So a good example of this challenge of working with a ton of files inside of something like Amazon S3 comes from. Who, uh, invented and then open sourced and released iceberg. They were generating lots of data ever increasing amounts of data, and they wanted to do high performance analytics on top of that data.

And it was getting more and more challenging in those data volume screwed. So iceberg was their design that is now. For several years lived inside of the Apache software foundation, [00:09:00] uh, a mechanism to make it easier to do analytics on top of those, those files. So it defines what a table looks like, where a table is, how many files belong to a table.

If the table schema has changed over time, and it allows you to repetition a table and it wasn't the first table format, and it may not be the, the last table format. It probably won't, but it has proved to be a highly. Adaptable. And a fairly robust design. And I'll give you a few examples. Iceberg is interesting.

Uh, it's interesting to snowflake and it's also interesting to me personally, because it is not engine specific. It's not tied to something like spark. Where other table formats are. It's also not file format specific. It is, it supports three file formats. And those that list will probably grow over time.

So the underpinnings of iceberg were vendor agnostic and future looking. [00:10:00] And because of that, I think there's been tremendous momentum, both in the commercial space and also the customer interest and adoption space around iceberg commercially, a number of cloud vendors and SaaS vendors have staked on iceberg and snowflake is one of them and cloud vendors have done the same.

And, uh, it's, it's truly, we've been, um, we've even been a little surprised by how interested customers have gotten over, uh, around iceberg over the last.

Steve Hamm: Yeah, that's interesting. So has it really just been soaring in the last year or so, or did it have kind of a, a long takeoff before that?

James Malone: It, it had a longer takeoff, but over the past year, I would say it's following a fairly exponential growth at this point. And clearly that exponential growth can't necessarily continue forever, but the growth is best modeled exponentially right now, the, over the last year. The number that we've seen an inversion the best way to look at that, [00:11:00] uh, look at this is whether we snowflake or somebody else's, uh, bringing up iceberg.

And a year ago we were leading a lot of the conversations on some conversations on iceberg to gather interest and customer feedback today. Nine times out of 10 it's customers who are bringing it up organically and are asking snowflake either whether we support iceberg and what our plans to support iceberg are.

So we've seen just a huge shift in the mindset of customers specifically over the last year. And I think a lot of that ties to the really rapid and widespread adoption and interest in iceberg, outside of snow.

Steve Hamm: Yeah, it's interesting to see cuz obviously this is an open source technology and throughout the tech industry, for years now, we've seen kind of a hybrid mishmosh of a combination of open source and proprietary, uh, technologies. And it seems like. You know, this is really the way the world is working these days.

So tell us, why did snowflake decide to take this extra step [00:12:00] and really support this open source technology? So, so strongly.

James Malone: That's a good question. I would say my, the best way to look at this is, uh, many customers, especially large customers tend to have open source. At some point in their data stack. It could be on the storage layer. It could be on the query or engine layer. It could be on the management layer, could be a combination thereof.

And these customers like open source for a whole lot of reasons. And what we want to do is incorporate and meet those customers where they are and meet their needs. So it's not about from the snowflake point of view. If, if something that is open source is working for a customer, we want to meet the customer where they are and incorporate.

That open source stack, uh, into iceberg, excuse me, into snowflake as a larger [00:13:00] platform. And, uh, I think. We see a lot of open source fitting often kind of the leading use cases for customers. So if a customer has a need, it's one of the first things we see customers do is, uh, especially for the more advanced needs is to see what might be out there that they could quickly use to solve a problem.

Uh, and we wanna make sure that customers can bring and incorporate all of their use cases into snow.

Steve Hamm: Yeah, it's interesting. Like years ago, there was a lot of concern about kind of vendor lock in, you know, like whether Microsoft, you know, places would be kind of stuck with Microsoft or Oracle or something like that. But it seems like this, this intermingling of open source is really broken that a bit.

Right. Is that, is that, accurate?

James Malone: I, I think it is. And, and what we see customers doing and generally what we've always recommended customers do. Open source is not inherently good or bad. It can solve a lot of good use cases. It's all about choosing open [00:14:00] source and we phrase it as choosing it wisely, but choosing open source to meet a specific need where it makes sense.

I think from my perspective, there's not there, there isn't today. And there probably will ever be a one size fits all solution. And we see a lot of customers that are choosing open source wisely to defend against lock in or to meet specific use cases or both. And I think it has, uh, made the concerns over lock in, uh, generally much less of a forefront thought for many customers.

Steve Hamm: Yeah. Now, as long as we're talking about open source stuff, so, and you mentioned this a moment ago, Apache spark yet another very important technology, very important open source technology that is kind of, uh, governed by the Apache foundation. Uh, Um, so people may, I mean, iceberg and Apache spark, they don't do the same things, but it, it seems like they relate to each other in significant ways.

So [00:15:00] please explain what's going on here. Tell us what's going on with spark and how do these two technologies relate?

James Malone: That's a good question. So. Many table formats, um, a table format itself, like iceberg tells you how the data is represented, what files belong to what table and what your tables look like. You need some tool or some engine to go actually query the data. Once you have it. Set as a table for many in the open source world, spark is probably the most popular tool to go out and, uh, and process, uh, or run queries against data in open table formats.

So. They relate together in so far as spark is a common engine to go do analytics on open source table formats. That's the easy answer. The more complicated answer is other table formats. [00:16:00] Require essentially spark to function. And that's always, that's not always a, uh, a clear outcome to many customers, um, from what you know, and spark is near and dear to me, when I actually joined Google, it was specifically to launch their managed spark service, uh, at this point, you know, seven or eight years ago.

And, uh, I think what has become more challenging? I think the, the pause we're seeing on spark to sort of ran out the question is, um, a lot of vendors have tried adding their own flavor of spark. So in a way, the open source project has been packaged and kind of sold in a way that is in some sense, closing it off.

And it's actually causing a lot of confusion for customers

Steve Hamm: Is that called forking or no?

James Malone: Yeah, it's almost, yes. So instead of actually, you know, instead of actually declaratively saying, we're gonna take this project and make it our own and offer own version, what instead we see happening is, um, there's different run times or different optimizations, which means just [00:17:00] because it is spark, it doesn't necessarily run the same way.

You might not even get the same results when you run the same query. And. That has, I think caused some pause on at least spark is an engine and there's other engines now that we've seen much more interest in. So things like trio, uh, but it has left the table format like iceberg. There's it. Since it's only the engine since spark is only the engine on top, we've not seen that confusion, uh, start to trickle down to the table format.

Steve Hamm: Oh, okay. Very good. All right. So like snowflake, a lot of other companies within the modern data stack are adapting or are, are creating their own strategy and approach around iceberg, but they're not all the same. So help us understand how Snowflake's approach differs from some of the others.

James Malone: So the way, I most often explain this. Is there's [00:18:00] two big sets of benefits that snowflake is trying to bring to the table with iceberg. And I. A lot of other strategies that we've seen thus far in the commercial market solve for one of these, but not both at the same time, like, like snowflake does. And the first big bucket of benefits is our, our query engine.

Our query engine is highly optimized, very performant, and without getting too technical, our query engine, even. Can do things that other table formats, uh, and also other query engines can't do so good example is we can do multi table transactions on top of iceberg tables or join iceberg tables with native snowflake tables, uh, without any additional.

Work. Um, the, the query engine is a compelling value. Uh, I think the second big bucket of benefits that we're trying to bring to iceberg is all of our platform benefits. [00:19:00] So this things like encryption and replication and data governance and data sharing and search optimization and partitioning, all of those things are really hard to do in many platforms, especially with open source table formats.

Snowflake does all of those as part of its current offering, it's part of just being a data platform and we can layer on those benefits to iceberg fairly transparently. And that is a huge, in my view, that's a huge benefit to anybody using an open table format to have that seamless platform benefits without, uh, breaking the table format or trying to privatize it in a way where it stops being open

Steve Hamm: It's interesting. You've talked about all these efforts and all these things that snowflake has done to kind of adapt to iceberg and adapt to some of these other technologies for, for external data. Wouldn't it just be easier for these organizations to move all their data into [00:20:00] snowflake.

James Malone: Honest answer. Yes and no. Uh, would it make the world, I mean, would, would snowflake have, you know, I think any commercial offering would love to have one solution that works for everybody? Uh, I think probably it's a truism for any commercial product. If you can design one thing that fits everybody. That's great.

Uh, reality does not make that possible. Um, so I think what we, this goes to why we see customers using iceberg and open formats in the first place there's concerns around control. So some customers want to use open formats. They want to manage files. They don't wanna put, uh, something in storage that they can't see or access themselves.

And then third.

Customers going back to our, our, the earlier point on customers do want to interrupt or have snowflake work with other tools. And I think trying to have customers just move everything into snowflake would make one or more of those three points harder to navigate. So our solution is to design a brand new product from the ground up to squarely address all three of those needs.

So instead. Moving the data into snowflake. You can instead transparently apply Snowflake's platform and query engine on top of the open formats in the storage that you specify. And that's, that's why we designed iceberg as a brand new product, iceberg tables as brand new product. That way we get out of this game of worrying about moving data back and forth.

And instead snowflake can just operate on data where it [00:22:00] is in the open format that it currently is.

Steve Hamm: Yeah. Yeah, no, I, I think that I, I respect that approach. I mean, it is like, rather than demanding that customers fit into your mold, you are, you're willing to be flexible and do whatever. They need and there, and there are lots of different approaches. So I think that's an admirable strategy. Now you talked to a lot of developers in your, in your life, in your professional life there, whether within snowflake or within other organizations, what are they telling you about what they want to see changed and how data is located and managed?

James Malone: Yeah. There's some interesting trends that, that we see. I think one of the trends is. People and there's different personas, right? There might be data engineers. There might be SQL analysts. There might be ML engineers. So I'm speaking. [00:23:00] Broadly trying to encapsulate as many of these personas or customers, um, into the mix as possible.

We generally see that customers want to do more without spending more. And I think that's one kind of truism. Two it's still true. Customers don't want, even though they've incorporated more open source into their stacks, customers still don't want to be locked into any particular technology or decision.

And I think that often actually comes out in past. That customers have around, we chose this and then we couldn't change our architecture for end number of years or months because we were essentially locked in. Um, and I think the, the third is. Customers want to make sure that wherever something is stored, whatever technology they're using, it is really interoperable with the tools that they might want to bring to the picture later, or that snowflake is planning on bringing to the picture later.

So it's [00:24:00] really the. The big trends that we see. Um, those are the big trends that we see and what it means for us is, again, we just need to work around where customers are at. And instead of enforcing our view of our, our proprietary formats are great. And when you don't have to manage files, things are much more secure.

Both of those happen to be true, but for a lot of customers where they're at means that they do want to use object storage, they want to use open formats and they want snowflake to transparently layer on top of that. And we're trying to work around those, uh, those requests and those

Steve Hamm: Yeah. Yeah. You know, it's interesting. When I look at your career, you mentioned Disney, Amazon, Google, snowflake. You know, these are some of the most important technology companies in the world, and you've, you've gone from one to another. You've you've spent some good chunks of time at each. I get a sense that you're kind of on a mission.

What, what's the thing that kind of takes you through on this pathway from one [00:25:00] company to.

James Malone: That is a, that is a good and interesting question.

Steve Hamm: Yeah.

James Malone: I would say. In a lot of classic business thinking there is a tension between something being open and flexible and something being proprietary and monetizable. This is not a new problem. And I tell a lot of product managers, uh, that this, that, and this is I think, sort of a, uh, how a lot of people view history is.

There things often get repeated more often than they're new and the tension between this kind of open and flexible and commercial and proprietary is, and technology is certainly not new. You see it with, uh, you know, formats, you see it with, uh, computer architecture designs. You see it with communication technologies.

Steve Hamm: Right.

James Malone: I do think, however, that that dichotomy, that, that us versus them, or a, or B is a falsehood. I actually think it is entirely possible. [00:26:00] Um, and. Profitable for a company and, uh, very useful for a customer to really blend those two and actually take a look at, it's not it's, there's more success in the intersection or the, uh, the overlap between those two than there is in trying to ride the tension.

And I think when you try to ride the tension, you stay at one extreme or the other. I think your path to immediate success will be faster. Potential for long term success will be lower. I think when you ride the intersection between the two, your path to success will be slower, but the opportunity for longer term success will be much, much greater.

Uh, because I think you're, you're really meeting the needs of where most consumers are going to be, which is they do want, they pick your product because of some of its inherent unique qualities, but they also wanna make sure that your product works. And many other products that they may be using or considering UN.

Steve Hamm: Yeah. Yeah, no, I think that's a very good [00:27:00] point. Um, I like people who are on a mission, I, I feel like, you know, I think you've done a really good job of explaining pretty complicated stuff. On a level that even I can understand. So I, I really wanna appreciate how you've operated today. Um, and I love what you said about, about that tension between open source and proprietary and how it really shouldn't be a, a, a fight it's really, it's really about a collaboration.

And that intersection is where the success is. And I, I think that is a lesson that a lot of tech companies that a. You know, non-tech companies, the, the users of technology could really benefit from, so thank you so much for talking to us today.

James Malone: Thank you very much for having me,