3178: Is AI Infrastructure Broken? How DIaaS is Changing the Game
Tech Talks DailyFebruary 12, 2025
3178
39:1631.45 MB

3178: Is AI Infrastructure Broken? How DIaaS is Changing the Game

AI is everywhere, yet scaling it remains a serious challenge for many enterprises. Organizations are eager to move beyond pilot projects, but they quickly run into obstacles—cloud costs spiral out of control, GPUs sit idle due to inefficiencies, and infrastructure bottlenecks slow down innovation. If AI is supposed to be the future, why is the underlying infrastructure still holding it back?

In this episode, I'm joined by John Blumenthal, Chief Product Officer at Volumez, and Diane Gonzalez, Senior Director of Business Development and Product, to unpack the hidden inefficiencies of AI infrastructure and explore whether Data Infrastructure as a Service (DIaaS) is the answer. They explain why traditional AI infrastructure models are failing and why manual optimization simply can't keep up with modern AI workloads.

John argues that the problem is beyond human management, stating, "No human being is going to be able to answer that question—to actually arrive at that optimization. So you have to turn it over to a machine." We explore why cloud spending is so difficult to control, how businesses are struggling to achieve a return on their AI investments, and what it will take to make AI infrastructure more cost-effective and scalable.

Diane shares real-world examples of how AI teams are losing valuable time and resources trying to work around infrastructure constraints. She explains how DIaaS dynamically composes infrastructure on demand, eliminating bottlenecks and ensuring businesses can run AI workloads at full speed without overspending.

With AI adoption accelerating, organizations are rethinking their approach to cloud infrastructure. But is DIaaS the missing piece that will finally make AI truly scalable? And are businesses ready to let automation take the lead in optimizing their cloud environments? Join the conversation and share your thoughts.

[00:00:03] How can enterprises overcome the roadblocks to scaling AI and machine learning beyond the pilot phase? It's a challenge that many businesses are grappling with as they chase the promise of AI, but find themselves struggling with infrastructure inefficiencies, data bottlenecks and soaring costs. Well today I'm going to be joined by John and Diane from a company called Volumes

[00:00:29] and they are a company at the forefront of AI driven data infrastructure. John brings decades of experience in storage, observability and cloud technologies while Diane has a deep background in infrastructure and product development. But most interestingly of all, together they're tackling inefficiencies that hinder AI and machine learning deployment at scale. So the big question is how can organisations unlock performance

[00:00:57] while also keeping their costs under control? And what role does automation play in optimising GPU utilisation for AI workloads? Well on the recent IT press tour of Silicon Valley, I was fortunate enough to have the opportunity to sit down with them both to learn more. So enough scene setting for me. Let's get them both onto the podcast now. A massive warm welcome to both of you to the show today.

[00:01:25] We've got not one, but two great guests joining me. So to begin with Diane, can you tell everyone listening a little about who you are and what you do? Yeah, my name is Diane Gonzalez and I recently joined Volumes. I come from a deep background in infrastructure and product management and at Volumes. I'm the Senior Director of Business Development and Product. I work for John, my ally here, my Sergeant at Arms, so to speak.

[00:01:51] A little bit. And we've obviously, we've all met in Silicon Valley as part of the IT press tour. You mentioned John there. So John, would you mind telling everyone listening a little about who you are too? Yeah, I've been a long stalwart in Silicon Valley for more than 20 years. I actually grew up in the storage industry at Veritas under Mark Leslie, if you guys remember Veritas back in those days. And then I left in the early days of VMware.

[00:02:19] And so I was at the core of the kernel team there and product management on storage and the storage stack in ESX that some of you may know and love still. And then I left to start a company which was an early observability company that took me into the world of SaaS based analytics and machine learning called Cloud Physics, which was acquired by HPE several years back.

[00:02:43] And this is where Diane and I were working together on SaaS services for the GreenLake platform that were specific to data protection, for example. And then I left to join volumes a couple of years ago to lead product as the CPO. And I'm based here in San Francisco. Well, thank you to both of you for joining me today. And one of the things that we're seeing in our news feeds every day is yes, AI adoption is growing rapidly.

[00:03:11] Yet many enterprises are still struggling to scale their AI projects beyond pilot phases and also struggling to see the ROI in some of those projects, too. So just to set the scene for our conversation today, how are you at volumes addressing this common challenge, a challenge that obviously hinders AI and ML deployment at scale? Because it's something that everyone's chasing at the moment, but getting there seems to be somewhat challenging. Well, let's just first start with the need. You're absolutely right.

[00:03:37] Enterprises are very hungry for bringing AI and especially generative AI into their environments. I think many enterprises have benefited from AI already for years. Like whenever you go to a product and you hear something about predictive analytics or understanding problems before they happen, that's generally AI and ML in the back end.

[00:03:58] That is processing large amounts of data and leveraging pattern recognition and learning on those patterns to identify outliers, etc., to then produce some sort of outcome or a trigger or an alerting mechanism. And you see that across all sorts of industries. I think the big thing that's really kind of fed the hunger for AI is this, the evolution into gen AI, where we're now seeing these, this AI ML is now producing original content.

[00:04:27] And if you take a look at the use cases for that original content, it really just the explosion of what can happen for a business in terms of outcomes and innovation is really quite dramatic.

[00:04:41] But in order to really take advantage of that, you have to have the proper infrastructure, the proper knowledge in-house, the operational frameworks in place, not to mention seamless data availability and understanding how to work with that data and the infrastructure to actually produce the inference or the outcomes. And this is where you see a lot of problems really start to creep up.

[00:05:08] When you're talking about foundational models, even if you're fine-tuning a foundational model or something like that, you are really leveraging some expensive compute in the form of GPUs. And you really have to prime what is legacy infrastructure to feed those GPUs. And it's a very inefficient or incongruent balance leading to unbalance, as you would say, of infrastructure pieces across most enterprises. And so the cost starts to creep up.

[00:05:36] It either creeps up because you're not feeding your GPUs appropriately, therefore you're wasting money in GPUs, or you're actually over-provisioning your infrastructure in an attempt to overcome IO bottlenecks or storage. And that cramps your operation times. It affects how you size your workloads. You spend a lot of time in root cause analysis. And so both on the hard and soft cost side, you end up with this tremendous imbalance. And that's really where you're seeing a lot of companies fail.

[00:06:06] They kind of give up. It's like, okay, I see the value here, but I can't get my project past pilot either in time, maybe sometimes not at all. Because of the tremendous kind of jump that has to happen either in terms of infrastructure or knowledge base, etc. Now, interestingly enough, I think that trend, and John can also, we've talked quite a bit about this, is on the verge of possibly changing.

[00:06:35] I think we had the DeepSeq announcement that happened. And really kind of the main takeaway from that is this tremendous efficiency that was allegedly achieved, where comparatively training their language model, comparative to something like ChatGPT, was $100 million versus roughly $6 million. Right? So you're talking about an exponential efficiency gain.

[00:07:02] And that has kind of revamped the interest among enterprises and AI. And I think the reason why is because they're acknowledging that AI itself, as something attainable, there's efficiencies. There's efficiencies that are happening. They're coming to play that they might be able to take advantage of. That means that these improvements are actually going to increase the utilization of AI. Right? So I think we've reached kind of a curve where there was a backoff.

[00:07:31] You could call this the hype curve. Everybody's seen the Gartner hype curve. AI, AI, AI, AI. Oops, in practice, really, really, really difficult. And that starts to come down. And now with this deep seek, I think it's really kind of bad people, this new vibe of, wait a minute. It's not so secret. There is a way to attain this. There's some operational efficiencies that perhaps we can all take advantage of. And now it's kind of creeping back up.

[00:07:57] This is being amplified by Satya Nadella himself, who's quoting this 160, 170-year-old Jevon's paradox, which talks about how when you create efficiency on some resource, you actually aren't decreasing the demand for that resource. Demand actually goes up.

[00:08:18] And so I think, Diane, the point you're making, which I find really intriguing, is that efficiency is at the core of the ability to get to the ROI. That has been using everything to degrade on these two precious resources that aren't really being operated in an efficient way. One are GPUs, and the other are the data scientists themselves.

[00:08:43] And so our entire focus is really around this operative word of efficiency. And what Diane mentioned, how to attain that efficiency, is really around creating balance in the use of the resources that create a level of optimization, both in the data scientists themselves and also with the GPUs themselves,

[00:09:06] in a way that delivers that kind of efficiency that lets that ROI be realized on this huge business value. Yeah, absolutely. I think to the second part of your question, there's the why is this problematic? And then what does volume spring to the table to really help this? That's really where efficiency is a particularly loaded term.

[00:09:29] But when we take a look at an AI ML pipeline, and you take a look at the requirements, both from the soft cost to the hard cost, we can talk about this just in straight infrastructure terms since a lot of people very readily understand that. Things like networking, compute, storage. This is the wheelhouse of where volumes place.

[00:09:52] So instead of looking at each one of those components as an isolated thing, which can happen very much in hybrid environments, when you take a look at AI, 70% of organizations roughly, there's a statistic floating around, leverage the cloud for a lot of their AI ML. There's reasons for that, but let's park that for a second. But if you're looking at it from a cloud perspective, where it's even more segregated than it would be on-prem, these disparate silos of infrastructure, we don't really look at it that way.

[00:10:21] And volumes, we look at it from end to end. And we're driving the efficiency by really understanding all of the underpinnings of each one of those silos. So that when we create our own, let's call it an infrastructure pipeline on demand, right? For that particular, let's say, pipeline, we create a very efficient, balanced infrastructure that maximizes that GPU utilization.

[00:10:47] We're working actually with a data scientist who leverages our infrastructure to do testing. And they gave us really interesting food for thought. Before they started working with us, they were maybe maximizing their GPUs up at, what, 50%? And after they started working with us, they're plugged at 100%. So if you think about it, we've had the cost of what their normal runtime would be.

[00:11:14] And you can think of that in soft costs like operations and workflows. Or you can think of it in hard costs from just the sheer amount of time you have your GPUs running. And across all of that, we've developed and brought kind of a maximum yield return on investment that is really, I think, new to the market.

[00:11:34] And I think something we can all agree on is one of the biggest issues in AI and ML today is underutilized GPU capacity due to IO bottlenecks and inefficient data pipelines. And you touched on it there a moment ago, Diane, about how volumes optimizes GPU utilization and ultimately ensures that maximum performance that businesses are looking for.

[00:11:56] But before you came on the podcast today, I was also reading you've recently achieved an industry-leading results in the ML Commons ML Perf Storage 1.0 benchmark. Sounds incredibly impressive. If we could just break it down to exactly what that means for AI and ML workloads and why this is so significant. Yeah.

[00:12:17] So the results that we achieved in ML Perf, the 1.0 benchmark that was published last September, was really, I think, evidence of our architecture and why the fundamentally different approach to particularly processing data pipelines in the public cloud

[00:12:37] can produce unbelievable scale, where scale is defined as maximizing your throughput in terms of the amount of data that you can push into a GPU network to then maintain utilization levels on a way that can scale linearly over time as you add more and more data to training or more and more jobs into the same mix.

[00:13:01] And so what we produced is a result of how we can orchestrate and create balance of the resources necessary to feed the GPU network, which is, it is a storage problem, but it's really a hybrid storage network problem. And that's why looking at the individual resources that are part of the data pipeline and optimizing those because you have deep knowledge and deep profiling on what those components contain,

[00:13:30] the result of that optimization is this level of efficiency that gives you that kind of scale. And so having GP that can be run at maximum utilization continuously, it has many, many benefits, not just for the consumer.

[00:13:46] So an enterprise using the public cloud, for example, will be able to, using our infrastructure, our data infrastructure as a service, the ability to instantiate that data pipeline or running this job. And then that job will, an infrastructure will be right-sized directly to what the job contains. The job is run at maximum utilization of the GPUs, and then the infrastructure is torn down and returned to inventory.

[00:14:15] So that ability to create, in some cases, massive scale just in time is a very, very compelling point for increasing the efficiency of your spending on some of the world's most precious resources like GPUs and the time of your data scientist. So from the end user's perspective, it goes back to the heart of what you opened this up with, Neil, which is around the impediments to the ROI and why projects fail.

[00:14:43] This is one of the root causes for why they fail, which is the stumbling around, especially the training phase for particular kinds of models that can be run at scale.

[00:14:52] The second part to this is really around the time of the data scientist and the ability for them to run multiple types of experiments to get to higher quality models and to achieve greater productivity as a data scientist that is a level of automation that is another interpretation of scale and utilization.

[00:15:14] And we believe solving these twin problems of GPU utilization at scale and the data scientist also operating with new levels of efficiency and scale is a real value to the consumer. Now, to cloud providers, many of whom we're working with in partnerships, this really does translate into a traditional manufacturing concept that we call GPU yield or yield on your data pipeline.

[00:15:43] And Diane always takes me to the broader perspective of the data pipeline yield. And I'm going to yield, Diane, to you to describe what it means to have data pipeline yield, especially from the context of a cloud provider who has many jobs, many customers trying to access their GPU infrastructure.

[00:16:06] And what does that mean to a provider to actually have this level of utilization across those data pipelines? So, Diane, you have a really good exposition on this that we've talked about internally. Yeah. So when I think of like the maximum yield concept that we're all accustomed to thinking about when we think of like steps and compute resources, I actually think of it at a much greater scale, especially when it comes to AI.

[00:16:31] And the reason why is because the tolerance for time lapse and error is much lower. It's significantly lower. So if you think about it, we have an error in a table and a SQL database and a traditional application. It causes a problem. There's a delay. There's time to troubleshoot it. Oops, didn't really hurt much of anything.

[00:16:54] And in fact, the cost of that infrastructure and the actual cost to figure out where that error was, etc., is pretty self-contained. It's in a very modular, small, compact space as it would be in a lot of legacy form factors. Now, when you take a look at AI, you're talking about pipelines that just on the pre-training phase is just hours of data loading, parsing, understanding, sizing.

[00:17:21] And if there's ever a fault or an error to that, you really have to start all over or you have to roll back to a checkpoint. And even at the checkpoint, you have to reload everything. So the tolerance for an error or the tolerance for fault is much lower. So when I think of yield, I go way beyond – I take a look at the entire pipeline. And I think of – John hit some really key terms there. I think about the ability to drive cost for performance, right? Like what is that ratio?

[00:17:49] If I'm deploying an infrastructure to support, for example, inferencing and some small amount of training, even if I'm just retraining or fine-tuning a foundational model, that infrastructure is pretty isolated and loaded to that. And if I'm doing it in a traditional way, I'm probably oversizing and it's going to be very expensive. In the cloud space, this idea of bringing it up just in time and then tearing it down is largely attractive,

[00:18:16] especially if I can overcome the inherent restrictions across each of the silos of the services, the infrastructure services that are in the cloud to deliver that maximum yield. It becomes very powerful at that point. And so when I think of maximum yield or yield of a pipeline, I think of it across not just the hard costs that we typically think of, and I see it much further than just making sure our GPUs are functioning at 100%.

[00:18:43] I really think about it from an operation time perspective. I think about it from a resource drain perspective. So really interesting data studies that are out there for AI ML, especially when you look at efficiency of the data scientist versus engagement with ML Ops versus engagement with core infrastructure.

[00:19:05] And in an average pipeline, there's plenty of studies out there, but the numbers hover anywhere from 20% to maybe 40% efficiency of actual innovation and compute time. The rest of that time is spent troubleshooting, sizing, restarting. The data scientist has to spend a lot of their time trying to figure out what's going on. Now you're error prone, your risk increases, which then affects your entire pipeline,

[00:19:32] because you might then have to go back during an inferencing phase or a fine tuning phase and reload, restart all over and revamp. So for me, more volumes really brings a lot of value to the table is in this looking at the picture, the composable infrastructure pieces, really as a declarative framework that can be dialed up from the right persona just in time,

[00:20:02] and then kept alive for just the right amount of time to then be able to reproduce over and over as necessary with the pipeline. And we can do that again from the persona that is actually doing the work, which John hasn't really touched a lot on that. But we do have integrations in place that really streamline the operations for somebody like a data scientist, so that they have a much more harmonious relationship with their ML Ops teams.

[00:20:28] And there's no resource drain there from a soft cost perspective, which again affects the entire pipeline and the return on investment. Because if your GPUs are really expensive, if we consider that your most expensive asset, trust me, your data scientists are probably number two, if not number one on that. Because they're the ones who really, not only do they govern how the GPUs are used, but they're also pretty expensive resources themselves. So many great points there.

[00:20:53] As we've talked about, there's no escaping the fact that data infrastructure costs for AI are on the rise. Many enterprises are struggling with that infamous balance of performance and cost. So, John, anything you'd like to add on that, on how you help organizations find that right balance to optimize their cloud spending without compromising performance? Very tricky balance. Not as easy as it sounds, right? No. No. In fact, let's say it's beyond human cognition.

[00:21:22] No human being is going to be able to answer that question to actually arrive at that optimization. So you have to turn it over to a machine. And this is what is the reason for the existence of volumes, which is to answer those questions in a way that is automated and also continuously monitoring for those optimizations to create them for user just in time and highly tailored to the particular nature of the workloads that it's meant to support.

[00:21:51] So achieving this balance or this optimization to get the right amount of performance with cost really is at the heart of the unique value proposition and intellectual property that's been developed and delivered at volumes. And so the way we do that is by profiling what the public cloud contains in terms of different IAS components

[00:22:17] and assembling those just in time with intelligence constructed from those profiles. And by having that deep machine learning knowledge of what's available within the public cloud to assemble based on a particular workloads requirements, you can get very, very precise on what the infrastructure needs to contain in order to deliver to that workload with the constraint of the cost that you impose upon our system.

[00:22:46] So this is really extracting all the goodness of the public cloud that is inaccessible to users today because they can't answer that question by themselves. Literally, even the armies that sit inside of GCP and AWS and Azure and OCI, you can never assemble the team of experts to really answer those questions and get it right. And there's just too much change happening in the infrastructure.

[00:23:14] The workloads change. The requirements change. This is really the preserve of a machine. And that's what volumes is about. And you can see that, by the way, going back to the ML perf that you, the results that you were talking about. There is a, like, if you look at that graph, it is very compelling. Like, the one that we presented that you were able to see.

[00:23:41] If you look at that graph, it's the difference between what we posted and the next best thing is what, roughly a third, two thirds almost higher. So it's like you see already when we're talking about composing or putting together this data infrastructure in the most harmonious way to achieve the maximum yield for the entire pipeline, you're looking at it not just from a soft cost perspective again,

[00:24:10] data scientists, optimizing the operational flow, et cetera. You're looking at it from a hard cut perspective. But if you look at the ML perf results as your yield, your outcome, what you're inferencing and then having to look at again, so your result, it's exponentially higher. So you see gains in return on investment across all of it. You see it in the efficiency of your teams. You see it in the efficiency of your infrastructure.

[00:24:36] And then what you output is also extremely performant and efficient, right? So think of each one of those as stepping stones to the next phase and the next evolution of innovation. And that's really where volumes is really playing. So one thing, Diane, you just said that I think I want to toss in here is when you talk about those next phases, there's also the concept of future-proofing your data pipelines.

[00:25:02] And so when you look at everyone else in the ML perf space that delivered those results, many are on-premise vendors. Some of them took their controllers and put them in the public cloud. But when you go to build such a system, especially with an on-premise traditional storage controller design, and there are multiple types of architectures that range from legacy to modern, the point here is that you are building a system that has to be qualified

[00:25:31] on a specific type of hardware, a specific set of components. And that's a fixed thing. It's a fixed design that you design to. And you hold that steady so that you don't introduce new factors that creates instabilities or a lack of predictability on what the system is going to deliver. However, we are fundamentally different in the sense that we are taking what the public cloud has actually deployed

[00:25:58] and continues to deploy at the component level, the most modern hardware, the fastest, the cheapest, or least expensive, and the most capable components actually show up in the public cloud two to five years before they'll show up in a storage system, for example. And having gotten gray hair over this fact that the time that you can actually release storage controllers to market was a very, let's say, geological pace.

[00:26:27] So what I'm getting at here is that every time the public cloud introduces a new component, whether it's faster SSDs, PCIe version 5, faster interconnects, faster networking, less expensive networking, with the cost of, say, RDMA coming down, we immediately profile and incorporate that. So when you build a data pipeline using volumes,

[00:26:52] you are actually using the fastest, the least expensive, and the most capable components that are being assembled based on our machine that creates that for you. That set of components won't really show up in these other services or these other types of storage controllers for several years to come. So when you're talking about the speed of change and the rate of change, both on the economics as well as the capabilities of an underlying data pipeline,

[00:27:22] the fastest way to get there, which will also produce the best economics, is using an approach like ours, which composes and orchestrates these entire data pipelines, incorporating the very best that the cloud can offer. And the cloud just gets there faster because of their supply chains and the way they qualify these things than what traditionally happens inside of the storage industry.

[00:27:48] And many enterprises are moving towards hybrid and multi-cloud strategies for AI workloads, which often further complicates things. And I would imagine you get a lot of people coming to you asking for help and challenges around how they achieve that seamless integration across different cloud providers while also maintaining performance consistency. Anything you can share here on how you're helping get over these challenges and turn them into opportunities?

[00:28:15] Diane, you said, Diane, incoming from HPD, which has a very strong hybrid story, and Diane was central to that story and the creation of products around it. Dan, do you want to talk about this? Like, what is the value prop around being able to deliver across all clouds? Yeah. So, well, first of all, it's the reality of today. Let's just start with that. When you take a look at spend, infrastructure spend worldwide,

[00:28:42] I think still, and if you're taking a look at the amount of investment, if you look at it from a data center perspective, and you're taking a look at the amount of investment that, for example, cloud providers are doing in the data center, your biggest spend is going to be in the data center, on crumb, so to speak. Now, what's interesting about this data point is that it does encompass a lot of cloud providers, and those cloud providers are pouring a lot of money in the billions into data center spend for AI infrastructure,

[00:29:13] because they recognize that companies or enterprises need to adopt a hybrid strategy in order to truly execute their outcomes, especially when they're looking at AI. There's some things that they can pull on from, and this is where you see a lot of, let's call it AI in a box solutions, that you're seeing come out, some really good ones from different vendors, and they're really tailor-made for when you don't need these massive scales of GPU clusters,

[00:29:42] or memory, or storage, where you can do perhaps inferencing, RAG use cases, or perhaps some level of fine-tuning with an existing model. But anytime you get beyond that, especially when you're processing the large amounts of data for something like video processing, or language processing, or images, like in healthcare, some of that can't be done. You need a footprint that you're either going to have to, as an enterprise, invest very heavily on, that then you have to ask yourself,

[00:30:12] is it a worthwhile investment? Or you leverage the cloud, right? And so I think a hybrid strategy, especially in AI, is going to be the case until we get to the point where that performance density is something that can perhaps take some of these larger GPU cluster needs. This kind of goes to our first point, which how can we drive efficiencies? And how can we take that and then bring it into an on-prem footprint? But then that begs the question, if we're that hyper-efficient,

[00:30:40] then that makes it even better for a cloud play, right? Because now that just-in-time, there's significant cost savings to being able to bring things up just in time for just the right amount of time to then reduce your overall spend and really focus on achieving your return on investment and outcomes. So I think hybrid strategies, especially in AI, are always going to be there. I think from a volumes perspective, we understand that

[00:31:06] and we continue to expand what we are doing in the cloud to make sure that we're continuing to drive that efficiency for any of the AI workloads specifically processed in the cloud because that's where we're looking to drive the efficiencies. I think, John, you did a segment when we were in California about what we're targeting with volumes when we talk about AI ML. And I think that's a really important thing to kind of cover here

[00:31:32] because we're not really talking about something small, but we're also not talking about something that is as gargantuan as perhaps training a deep seek, for example. Yeah. Yeah. I mean, the pipeline contains different types of workloads that range from the ingestion phase when you're actually loading data out of one location or a slower object storage.

[00:32:00] And you need to bring that on to a faster staging area, and it's usually a subset of data than what is the master copy and the master data sitting in, say, an object store. And in there, you want to be able to do preprocessing, which has to do with both cleaning data, unique transformations of the data. There's a lot of I.O. that happens in this preprocessing tier that is a very different workload than modeling,

[00:32:28] and it's a CPU-based layer that the data pipeline contains. And so this workload can be very I.O. intensive. It can have really strange characteristics on how stuff is read sequentially and then written out sequentially. So it's a big challenge and usually very specific to the preprocessing logic that drives it. And then from there, it's submitted into a training phase or a fine-tuning phase that really has to do with

[00:32:58] how fast can I get the data and throughput into the GPU network. And this can vary depending on the model and depending on the data set sizes. In many cases, models, especially transformer models or large LLMs, they don't really have a storage problem. They have a GPU problem because they are taking billions or trillions of embeddings and actual weights to actually do this magic that happens in a neural network.

[00:33:26] And they're not being fed really large files like that result from video or imaging or large audio files, for example. Those definitely have a storage networking problem. And that was actually the basis of MLperf. And then from training, you end up having checkpointing as another infrastructure challenge, which in some cases has characteristics of the HPC workloads that we know and love. In other ways, it looks very different than HPC

[00:33:55] in the way the entire memory of the GPU instance and the cluster gets dumped. But that's a very important workload that's very different than the other ones I've just described. And then as you get to the inferencing, when the model's actually produced and deployed, there you have latency criticality that is, again, a very different type of workload. So the data pipeline has five or six unique workloads. There is no one infrastructure that's going to serve it all.

[00:34:24] And this is why you have to have just-in-time, highly tailored, machine-generated infrastructure for these workloads, step by step by step. They need to be temporal because they keep changing all the time. And you cannot build an infrastructure to serve all of these without creating tremendous amounts of waste. It's just not possible from an infrastructure perspective to do this as a holistic, generalized platform that will serve all of the needs of these different workloads.

[00:34:54] So as a result, there's a very nice marriage from understanding each public cloud's unique characteristics and capabilities in each of the public clouds. And this is why our profiling capabilities is specific to each public cloud. We call this cloud awareness as a label for what we generate and store as a catalog that we reference as we understand what we need to create

[00:35:22] for each of these workloads. And then in its entirety, it's enchained in a way that is automated and doesn't require human intervention to actually create the infrastructure, make it perform properly, and then bring that infrastructure down when the job is actually completed. So having this available in each of the public clouds, all of whom have different capabilities, different network capacities, different costs and structures,

[00:35:52] I think is an absolute necessity, as Diane was saying, to deploy a multi-cloud strategy. And that's what, you know, is again at the heart of Volumes is the ability to apply these unique intellectual property techniques we've developed to do that profiling and then do the just-in-time tailoring and provisioning of the infrastructure based on the different characteristics of the workloads in the data pipeline. And I think this is something we could talk about for hours. And I'd love to get you back

[00:36:22] on the podcast maybe in a few months to talk about this in more detail and also look ahead at some of the challenges in AI infrastructure and how you're preparing for the future. I think I am back in Silicon Valley in June, so maybe we could do something as soon as that. But before that, we've whetted everybody's appetite today. Anybody listening wanting to dig a little bit deeper on all the things that you do, maybe connect with yourselves or your team, where would you like to point everyone listening? Yeah, go to volumes.com.

[00:36:51] You'll actually see early access to the solutions that we're describing here as data infrastructure as a service for training. We're actually expanding that to some of these other workloads, but we've opened up early access to this offering that's based on our DS, our data infrastructure as a service platform. And from there, you'll get guided to contacting us directly and getting access to the fundamental aspects of this platform. And we'd love to have a conversation about it.

[00:37:20] If you go to volumes.com, there's actually a get started section where if you want to, you can, first of all, you can understand and read a little bit more about the ML Commons benchmark. But there's also a way for you to kind of sign up and test some of this stuff out. Yep. Well, I'll have links to absolutely everything so people can find you nice and easy. And I meant what I said there. It would be great to get you back on in a few months and expand on this conversation we started today. I would encourage

[00:37:49] everybody listening to contact myself or indeed you if they've got any big questions that they would like asking or answering. But more than anything, just thank you for sitting down with me today. Really appreciate your time. Likewise, Neil. Thank you very much. So with AI infrastructure costs skyrocketing and enterprises facing mounting challenges in balancing performance and cost, finding the right approach to data pipelines is more critical than ever. It seems that Volumes

[00:38:19] is bringing a fresh perspective to tackling some of these inefficiencies, ultimately ensuring that AI workloads run at peak performance without unnecessary waste. But as AI adoption continues to evolve, what will be the next hurdle for enterprises? Will automation and infrastructure as a service models become the standard for optimizing AI pipelines? Or are we merely scratching the surface? As always, I'd love to hear your thoughts.

[00:38:49] Email me now, techblogwriteroutlook.com LinkedIn Instagram X just at Neil C. Hughes. Let me know. But that is it for today. I'll be back again tomorrow with another guest. Hopefully you'll join me again tomorrow, but thank you as always for listening today. And until next time, don't be a stranger. Don't be a stranger.