2944: What is a Lakehouse, and Why is it the Next Big Thing in Data?
Tech Talks DailyJune 26, 2024
2944
19:1815.45 MB

2944: What is a Lakehouse, and Why is it the Next Big Thing in Data?

Have you ever wondered what the future of data management looks like? In this episode, we dive into the world of data lakehouses with Ori Rafael, the CEO and co-founder of Upsolver. Ori shares his insights on why the lakehouse is poised to be the next big thing in data, and how Upsolver is at the forefront of this revolutionary architecture.

A data lakehouse is not just a buzzword; it's a transformative approach that decouples storage, metadata management, and compute. Ori explains how this separation allows for greater flexibility and significant cost savings compared to traditional data warehouses. By leveraging object storage like S3, open-source Iceberg for metadata, and various compute engines, lakehouses reduce vendor lock-in and provide the ability to use specialized engines for different workloads, such as AI.

We explore the key advantages of lakehouses, including cost reduction, flexibility, and avoiding vendor lock-in. However, transitioning to a lakehouse architecture is not without its challenges. Ensuring performance parity with data warehouses and managing data access controls are significant hurdles. Ori discusses how Upsolver is tackling these challenges head-on, providing ETL solutions and lake management capabilities that optimize data lakes for performance and interoperability.

The episode also delves into the trends shaping the future of data management. With the rapid adoption of open lakehouses and Iceberg emerging as the standard, enterprises are moving away from traditional data warehouses and legacy data lakes. Ori provides a glimpse into how open source catalogs with governance capabilities are evolving, paving the way for more robust and scalable data management solutions.

We wrap up the conversation by asking Ori a fun question: If he could have a private breakfast or lunch with anyone in the business, VC funding, or tech world, who would it be and why? You never know, the person he mentions might just be listening!

Join us for this insightful discussion on the future of data management and discover why the lakehouse is the next big thing in the industry. Be sure to find out more about Upsolver and their innovative solutions by visiting their website or connecting with their team online.

[00:00:01] What is it that makes the lake house architecture the next big thing in data management? And can it really revolutionize the way that we handle data? Well today on Tech Talks Daily I'm joined by Ori Raphael, CEO and co-founder

[00:00:17] of a company called Upsolver. And my guest today is going to guide us through the intricacies of data lake houses and highlighting why this new architecture is gaining traction over traditional data warehouses. We'll do it all in a

[00:00:30] language everyone can understand and we'll also explore those key advantages of lake houses, the challenges that come with transitioning to that architecture and how data ingestion differs in lake houses too. So buckle up and hold on

[00:00:45] tight as I beam your ears all the way to California where we attempt to uncover the potential of lake houses and Upsolver's role in enabling an innovative approach to data management today. So a massive warm welcome to the show. Can you

[00:01:01] tell everyone listening a little about who you are and what you do? I am Ori Raphael, I'm the CEO and co-founder of Upsolver. Well it's a pleasure to have you join me on the podcast and every single day on this show what I try and

[00:01:15] do is demystify technology and things that business leaders might have heard about but maybe find them a little bit daunting and overwhelming and a little too complex. So to set the scene for our conversation today can you explain what

[00:01:29] a lake house is and why it's considered the next big thing in data management? So the lake house is the deconstructed warehouse. What does it mean deconstructed? It means that you're taking the main components which are storage, metadata

[00:01:45] management which is done in a catalog and queries and basically separate them into different pieces. So if I'm a customer and I want to use the lake house I can keep my storage just on S3 and manage my metadata in an

[00:02:00] iceberg catalog and then I can basically decide which warehouse engine I want to run on top of that data. But the idea of the lake house is basically to sweep companies from the warehouse vendor locking which some people are thinking

[00:02:14] about well this can save me some money but other than just saving some money it can also allow you to use a multitude of tools and for example if you want to use

[00:02:22] a better tool for AI than your current warehouse you are able to do that. So it supports the AI movement very well and it reduces the cost of the warehouse by quite a lot. And for any business leader listening, hearing about this for the first time

[00:02:39] what would you say are the main business advantages of a lake house compared to let's say those traditional data warehouses? The number one would be cost. So when I'm going to write into a warehouse I

[00:02:53] basically need to pay for an ETL that's writing into the warehouse and I need to pay for a warehouse cluster that needs to be up in order for me to write the data. So once you are writing data and in a way that is connected from the

[00:03:05] actual warehouse you're basically building a file system you only need the ETL piece. So you're saving all the money that you used to spend on keeping warehouses clusters up for ingestion. That would be cost reduction number one.

[00:03:19] Cost reduction number two would be the fact that you keep less copies of your raw data. So many companies have a lake where they keep all their raw data and then they keep another copy of the exact same day in the warehouse so that data

[00:03:33] to be actually be queryable. Once you're basically building a lake house you're unifying those two copies your staging area becomes your querying area so you have one less copy of all your data. That's cost reduction number two and cost

[00:03:47] reduction number three is the fact that in a warehouse every time you want to do an update or delete that's a very heavy operation they're using a method called copy on write which means that you need to do a full table scan every time you

[00:03:59] are going to do an update or delete and in the lake house using formats like Icebelt you're able to apply a method called merge on read which means that you're not doing a full table scan every time you're doing an update or delete

[00:04:13] so you can write in real time into the warehouse and not pay for constant full table scans which is a big problem for warehouse companies. But those three are immediate cost reductions you can add to that the fact that you have more leverage

[00:04:26] more than in front talking to your warehouse vendor because you're no longer locked with that vendor you can just work with another vendor on top of those on top of your data. So cost reduction is a big piece there was a recent survey

[00:04:38] saying that enterprises are expecting to save more than 50% of their analytics cost by moving to an open lake house using formats like Iceberg and I think that's telling you why Iceberg is so hot at the moment. And reason number

[00:04:53] two is that the idea of one warehouse solving all your use cases doesn't hold especially with everything happening in AI. I grew up in the Oracle world I remember kind of the feeling of trying to shorn the database into the use case

[00:05:08] that's not a good fit for that and with the open lake house you're basically able to use one engine for BI one engine for AI and basically more any other type of engines that you want and for any additional use cases you want to do

[00:05:23] because your data is no longer locked. Those are the main events. And when you said the word cost reduction followed by listing three ways of making immediate cost savings I'm sure I could almost see light bulb moments going off all around

[00:05:37] the world but the question is when somebody embarks or an organization embarks on a journey like this what are the main challenges that organizations face when transitioning from a data warehouse to lake house architecture?

[00:05:50] I think that let's think about it from the perspective of the customer. I'm just going to replace my warehouse storage layer and everything will be fine. It's not something that they think about. They think okay what's

[00:06:03] going to happen to my queries? Are they going to work as fast as they are working on the warehouse? I think that's a main concern that I'm hearing from customers that the warehouses are doing a lot of optimization on their native

[00:06:17] storage layer to provide a certain level of performance. Will I get the same thing on top of iceberg and top of a lake house in general? And I think question number two is how am I going to manage access to my data? So if the warehouse is

[00:06:34] not my catalog what is my catalog and how do I enforce access control to the data? Those are the two main questions I think. Comparing this to a warehouse migration this is a much easier list compared to a warehouse

[00:06:51] migration because let's say I'm going to replace my warehouse A with warehouse B. In many cases I need to change the syntax and a lot of things need to change and it's a four year project to replace a warehouse. But in

[00:07:05] this case we are only replacing the underlying tables. I can continue using the same warehouse for queries so the query world is not being impacted. So it is a migration but it's a much easier one. I just need to create a new table

[00:07:20] and replace that with my old table. It's not a very hard thing to do but still requires a little bit of time. I'm curious how does data ingestion differ in a lake house environment in terms of things like scale, table management and

[00:07:34] storage systems when compared to those traditional warehouses? Well I think that the concept of ingestion is still similar with one big difference. When you're doing lake house ingestion I don't need a warehouse to be up. So

[00:07:50] let's say that right now this is happening to a lot of our customers and you found something is wrong. Like you've been writing bad data into your warehouse. You want to fix it. How are you going to fix it? You're going to delete

[00:08:02] the bad data from the warehouse and now you want as fast as possible to re-run historical replay, historical data into your warehouse so you'll have the missing data completed again. Now if you're going to do that now you need

[00:08:18] this huge warehouse cluster. You need to go and get let your data DBA change the cluster size. Now you're creating a lot of stress on the warehouse. So every time you write you basically need to think what are the implications on the

[00:08:32] warehouse performance side and cost wise. When you're doing ingestion into the lake house there is no warehouse up. Your only system that needs to be available is Amazon S3 or any other type of object storage which is basically

[00:08:47] infinitely scalable. So the fact that you can do ingestion at any scale without thinking of what's going to happen to the warehouse is a very big advantage for the lake house approach. And you know what I said one thing but I'll add the

[00:09:01] second thing which I previously mentioned in the when you ingest into the lake house you can apply the merge on read technique and not just the copy on write meaning that you can get real time data into the warehouse or to the

[00:09:15] lake house and without paying a huge cost for doing that without doing constant table scan which is also an advantage for the ingestion side. It's incredibly cool what you're doing here and this feels like a perfect opportunity to discuss Upsolver and there will be people listening around

[00:09:32] the world that are hearing about you for the first time so could you tell me a little bit more about Upsolver the kind of problems that you're solving for them and maybe even provide a few examples of how Upsolver's technology

[00:09:44] is helping companies overcome some of the challenges that we've discussed today especially associated with implementing a lake house. Can you introduce everyone listening to Upsolver? Yes so I think that the easiest way to explain it to your listeners is that

[00:10:00] Upsolver is the combination of ETL meets lake management. So if I have an ETL solution that doesn't have lake management that's great for the traditional warehouse. I just tried the data. The warehouse is going to manage the file system.

[00:10:15] It's not something that I as the ETL vendor or the customer need to think about. But if I'm going to write into the lake house I am as the customer responsible for managing and optimizing that layer for performance.

[00:10:29] So now this is a huge burden that goes on the customer. ETL vendors don't have a solution for that because they don't manage file system but Upsolver does. So we both write or ingest into your tables but we also optimize the file

[00:10:44] system in a way that's going to be you're going to have interoperability. You can work with any warehouse that you want and the performance that you're going to get is going to be similar to what the warehouse provides on its own native tables.

[00:10:58] Upsolver actually has products because of that. You can just use the ingestion. You can just use the lake management regardless of what you use the ETL and you can use the combination of both.

[00:11:09] So I'm curious because you're right in that part of all this you'll see so many big changes especially with the emergence of AI over the last couple of years and how everyone's gone crazy about that.

[00:11:20] So what trends do you see emerging in the data management space and how are you at Upsolver positioned to address some of these trends that you're seeing emerging? I think that the strongest trend that I'm not going to talk about all the trends

[00:11:35] even AI of course that AI is completely completely explored and it's creating the demand for what we are doing. But we live one story below closer to the infrastructure and the biggest trend and we've been waiting for a few years for that trend to happen is the

[00:11:50] transition into open lake house and specifically Iceberg. There was a fight between the different open table format Delta and Iceberg and Woody and Iceberg seems to be the clear winner. And every vendor you're going to ask me about is either supporting Iceberg in

[00:12:07] number one format or is it number two format that will include Snowflake and Databricks is just both tabular for a billion dollars and you have Amazon and Google and Snowflake and just choose the company. All of them are going to support Iceberg.

[00:12:21] So right now we are in a point of time where the standard for how you're going to build the lake house is there. I've been operating in this market before there was a standard so we've been

[00:12:31] building lakes that are queryable but you could only query them from specific engines. And now we are doing the exact same thing but our data is available for every warehouse engine possible. That basically really expanded our market and we are super excited about this

[00:12:51] trend and it's also great for the user. This is the right thing for the user to have more control over those pieces of not being locked into one specific vendor. And you mentioned consolidation in the market there and that combined with this

[00:13:06] increasing pace of technological changes is breathtaking at the moment. So I'm curious if you were to look into the future or dare to it's moving that fast at the moment. It's almost impossible to predict anything but how do you envision the

[00:13:18] future of data management evolving with the adoption of lake house architectures. What does that future look like do you think? So the first part that I think is already set and clear is that the open lake house is going to be the standard.

[00:13:34] That same survey I mentioned before is predicting 70 percent of enterprises running at least 50 percent of their analytics on the lake house in just three years. So that's a massive change that's going to happen from cloud data warehouses and on

[00:13:50] from data warehouses and also from old style data lakes like Hadoop and Spark. The part that's currently being set but hasn't been set completely is the catalog piece. So we talked about controlling access to data. What's going to be your catalog?

[00:14:07] And the Iceberg community has created something called a list catalog basically allowing every vendor to create its own Iceberg catalog and Snowflake recently launched Polaris, their open source Iceberg catalog. And I think that a lot of the companies in this market are going to standardize around

[00:14:27] one open catalog because they understand they cannot each keep their own catalog because that exceeds the purpose of creating interoperability. So what is the catalog of the future going to look like? Are the governance capability going to be open source or not?

[00:14:44] So those these parts are not completely set yet, but the fact that the open lake house is going to be Iceberg, I feel that's pretty set. Wow, I'd love to get you back on later in the year or early next year and see how things

[00:14:58] are continuously evolving, but I cannot thank you enough for taking the time to come on here, simplify things, put it in a language that everyone can understand. And we'll see if there's something we can do for you now, because some of the biggest

[00:15:09] names in business VC funding and tech have either been guests or maybe even listened to this podcast. So is there a person you'd love to have a private breakfast or lunch with? Who would it be and why he or she might get to hear this?

[00:15:22] Let's see what we can manifest together. Who would it be? I think it will be Sam out at what he has been doing with OpenAI. It's been amazing. And there is so much more that can be done.

[00:15:35] One of my investors told me that every time OpenAI is doing an event, a thousand startups die because they're so innovative in what they're doing. So after the initial launch, the fact that they kept innovating in every launch is very impressive, like the recent SuperCity type of product.

[00:15:58] I would be very curious how Sam is seeing the future and what's going to happen five, 10 years from now with search engines, for example, or with other use cases. So I would go talk to him first. Yeah, what a conversation that would be.

[00:16:13] I will throw that into the ether. Let's see what we can manifest together there. And I love that line your friend use every time OpenAI hosts a venture startup dies. There's an element of truth in that.

[00:16:26] But for anyone listening just wanting to find out more about Upsolver and explore some of the things we've talked about today, maybe connect with you or your team and got additional questions to ask. Where would you like to point them?

[00:16:38] I think the website is a great place, they can always go to our Slack channel as well to ask questions. But the best thing, you can just try the product. I mentioned that we had an ingestion product and a lake management product.

[00:16:51] If you're not already in Iceberg, go check out the ingestion product. You can build an Iceberg lake from any of your sources in about 10 minutes of work. And if you already have an Iceberg lake, go check out the management product.

[00:17:05] And basically before you're running any type of compute in your account, there is an analyzer. It's going to show you how much money you're going to save on storage and how much time you're going to save on query latency just by pointing to your existing Iceberg table.

[00:17:20] And our analyzer is going to spit out an answer for you in like a few seconds. You can understand what are the gains before even running a project. So all that takes a 10 minute product experience will help you understand everything much better than any website.

[00:17:35] Well, it's been a huge pleasure to have you on the podcast today. I'll add links to everything so people can find you nice and easy. And we covered so much there from defining exactly what is the lake house, why it's the next big thing in data,

[00:17:48] why business leaders need it and the challenges that might stand in the way of making the lake house the new warehouse, but also the great work you're doing in helping businesses along the way and navigating on that journey.

[00:18:00] But more than anything, just thank you for shining a light on this and sharing your insights. Thanks for joining me today. Thank you. Enjoy the session. I think it's clear that lake houses are poised to transform the data management landscape.

[00:18:13] But what are the next steps for businesses looking to adopt this architecture? How can they overcome the challenges in transitioning from traditional data warehouses? I cannot thank Hori enough for sharing his insights and expertise today. And for everybody tuning in, I'd love to hear your thoughts.

[00:18:30] What is it that you think is the most compelling aspect of lake houses? And how do you see them shaping the future of data management? And if you've been on a similar journey, I'd love to hear some of the lessons you learn along the way too.

[00:18:43] So email me techblogwriteroutlook.comx Instagram LinkedIn at Neil C Hughes. Let me know your thoughts. Any questions you have for me, please fire them across too. But it's time for me to get out of here now.

[00:18:56] I've got another guest lined up for you bright and early tomorrow to prepare for. So thanks for listening as always. And until next time. Don't be a stranger.