d-Matrix - Ultra-low Latency Batched Inference for Gen AI | The Tech Talks Network

What happens when the real bottleneck in artificial intelligence is no longer training models, but actually running them at scale?

In this episode of Tech Talks Daily, I sit down with Satyam Srivastava from d-Matrix to explore a shift that is quietly reshaping the entire AI infrastructure landscape. While much of the early AI race focused on training ever larger models, the next phase of AI adoption is increasingly defined by inference. That is the moment when trained models are deployed and used to generate real-world results millions of times a day.

Satyam brings a unique perspective shaped by years of experience in signal processing, machine learning, and hardware architecture, including time spent at NVIDIA and Intel working on graphics, media technologies, and AI systems. Now at d-Matrix, he is helping design next-generation computing architectures focused on one of the biggest challenges facing the AI industry today: efficiently running large language models without overwhelming data centers with unsustainable power and infrastructure demands.

During our conversation, we explored why the industry underestimated the infrastructure implications of inference at scale. While training large models grabs headlines, the real operational pressure often comes later when those models must serve millions of queries in real time. That shift places enormous strain on memory bandwidth, energy consumption, and data movement inside modern data centers.

Satyam explains how d-Matrix identified this challenge years before generative AI exploded into the mainstream. Instead of focusing on training hardware like many AI startups at the time, the company concentrated on inference efficiency. That decision is becoming increasingly relevant as organizations begin to realize that simply adding more GPUs to data centers is not a sustainable long-term strategy.

We also discuss the growing power constraints surrounding AI infrastructure, and why efficiency-driven design may be the only realistic path forward. With electricity supply, cooling capacity, and semiconductor availability all becoming limiting factors, the industry is being forced to rethink how AI systems are architected. Custom silicon, purpose-built accelerators, and heterogeneous computing environments are now emerging as key pieces of the puzzle.

The conversation also touches on the geopolitical and economic importance of AI semiconductor leadership, and why the relationship between frontier AI labs, infrastructure providers, and chip designers is becoming increasingly strategic. As governments and companies compete to maintain technological leadership, the question of who controls the hardware powering AI may prove just as important as the models themselves.

Looking ahead, Satyam shares his perspective on how the role of engineers will evolve as AI infrastructure becomes more specialized and energy-aware. Foundational engineering skills remain essential, but the next generation of engineers will also need to think in terms of entire systems, combining software, hardware, and AI tools to build more efficient computing environments.

As AI continues to move from research labs into everyday products and services, are organizations prepared for the infrastructure shift that comes with an inference-driven future? And could efficiency, rather than raw computing power, become the defining metric of the next phase of the AI race?

Useful Links

Connect With Satyam Srivastava
Learn more about d-Matrix

[00:00:04] Welcome back to the Tech Talks Daily Podcast. Today we're going to be talking about the part of AI that is quietly becoming the main event. I'm talking about inference. And as more organisations move from training big models to actually serving them in real products, the bottlenecks are shifting fast.

[00:00:22] And it stops being about the model size and starts becoming about the memory, the energy, the data movement, and whether your existing infrastructure has what it takes to keep up without burning money and power. My guest today is going to be joining me from a company called D-Matrix. He's an engineer who has spent years thinking about engineer efficient machine learning and what scalable inference should look like inside the data centre.

[00:00:49] So if you want a clear, grounded view of where AI infrastructure is heading, and also want to learn more about what leaders should be doing right now to prepare for the inevitable future, you're going to enjoy this conversation. But enough for me. Let me introduce you to my guest right now. So thank you for joining me on the podcast today. Can you tell everyone listening a little about who you are and what you do?

[00:01:16] Thank you for having me here, Neil. My name is Satyam. I am an electrical engineer by training. I have a background in signal processing, information theory and machine learning. I spent a brief time at NVIDIA before grad school and then spent about 10 years at Intel working across Intel graphics, media technology and AI.

[00:01:37] I spent a lot of time writing low level kernel code and that got me introduced to the hardware architecture and how to write performant code for hardware. And then eventually morph into more of a co-design role where you start influencing the hardware by knowing how to write good code for it. So that kind of created my interest in co-design for the most complex of problems in computing.

[00:02:04] Also at Intel, I saw I was looking at the growth of AI and what was happening in the industry at that time. And it was also somewhat evident even back six, seven years ago that there is a gap of efficiency in the computing infrastructure as AI technology was improving. So that led me to co-found along with some of my collaborators, a workshop on energy efficient machine learning, which we've been hosting every year since then for the last eight years.

[00:02:33] And the purpose of the workshop was to create a forum where people can exchange ideas related to machine learning, sustainability, architectures, algorithms. And it is through this network of like minded individuals focused on sustainability in AI, I got to know or meet this one and a half year old startup called D-Matrix.

[00:02:58] So I've been here for about five years since then and responsible for creating and architecting the solutions for sustainable, scalable AI computing for data center, primarily around inference. And that's one of the many reasons I was excited to have you join me today, because on this podcast every day, we try and demystify complex technologies, put them in a language everyone can understand.

[00:03:23] And right now, a quick scroll down our news feed will quickly tell us that AI is moving from a training driven era into inference first reality. So from your vantage point and everything that you're seeing here, what changes when inference becomes the dominant workload? And why are so many organizations maybe underestimating the infrastructure impact of such a shift? Yeah, that's very well stated.

[00:03:50] And in addition to the underestimating, I would say it's more towards a realization that they have historically underestimated the magnitude of the inference opportunity. So if you think about it, training is what grabbed the headlines. It's what proved the AI technology. It brought the technology into the mindset of individuals that this technology can solve vision. It can solve language.

[00:04:19] Historically, the realms of human intelligence. But really, for that to reach the customers into the hands of everyday individuals, it's the millions and millions of interactions that happen through inference that define the technology success.

[00:04:37] This is the reality that most companies are now realizing how big this is and the infrastructure that is needed to scale the innovation and make this technology something that gets integrated seamlessly into the lives of human society. And, of course, it's worth highlighting here. We're having this conversation right now in 2026. Organizations waking up to this reality.

[00:05:03] But in 2019, what, seven years ago, D-Matrix saw this bottleneck coming long ago. So what signals at the time hinted at that real constraint and how it would be things like memory, energy and data movement rather than just model performance? Tell me about that. Yeah, if I look back at 2019 era, the AI technology had already made its mark. It had solved ImageNet, the vision problem.

[00:05:32] It had excelled on natural language processing benchmarks, which were common in that day. And the language-based models had started to find a niche as part of web searches and some other translation-related opportunities. But the chat GPT moment hadn't arrived. So at that point, everybody was still building general-purpose hardware.

[00:05:55] A few startups back in the day were still trying to address training and take on the behemoth in the room. Whereas Dmatrix saw the opportunity that the bottleneck wasn't to make the models bigger and smarter. But for all the technology that was being created using these large models, someone had to serve them efficiently and at scale.

[00:06:20] So that's why Dmatrix from day one was founded to focus on inference as the key to unlocking the potential of this technology and making it accessible to a lot more people than just the frontier labs. So the mission for D-Matrix back in 2019 was to build the best transformer inference system in the world.

[00:06:43] And of course, fast forward to present day, we're now seeing memory chip shortages and power constraints and how they both can slow down AI projects around the world. So in practical terms, how should infrastructure leaders, do you think, be thinking differently about capacity planning when one data center could soon be expected to do the work of 10? Because that's how quickly things are moving, right? Yeah.

[00:07:08] So the expectation is that the data center computing capabilities have to become more and more dense. Now, the brute force way is by cramming more computing to the same footprint. The consequence of that is that you need, if you were to do this just brute force scaling, you have to put 10 times more chips, 10 times more power, 10 times more cooling infrastructure. And if you think about where things are today, it's just not sustainable.

[00:07:36] We just saw a couple of days ago a statement from Elon saying that turbines are sold out till 2030. So even if somebody was so motivated to just create a power plant next to a data center or dense data center, they simply cannot. It'll take a decade before they can do a scaling like that. Instead, what's more logical is to look at efficiency as a driver for this density.

[00:08:02] Imagine that instead of brute forcing and cramming more chips, if we had custom designs that can actually do the work equivalent to 10 times the GPUs. That gives us best of both worlds. You get the amount of computing density and the work that is needed. But also we don't run into the walls of physics that we simply cannot solve.

[00:08:25] So it's no longer about adding more GPUs, but about creating technology that enables data centers to be able to do 10 times more work. So we have to think about energy, memory and efficiency as part of building this modern infrastructure. And of course, if we look around, we see hyperscalers from Amazon and Google to Qualcomm, all in this race to build inference at scale.

[00:08:51] And again, from your perspective, where do traditional approaches, where you start seeing those break down? And what does a more purposeful inference stack unlock that general purpose systems cannot? What are you seeing here? Actually, this one is simpler. If I think of it from a technology point of view, traditional approaches start to break down. If you see the utilization of your top end hardware in the low single digit percentage,

[00:09:19] imagine that you've spent tens of thousands of dollars to acquire these chips. Then you're creating a sophisticated power delivery and cooling infrastructure only to see that your chips are 10% utilized. Now, anybody with a sound and rational approach would understand that we have a fundamental problem here. There's just no ROI for a scaling like this.

[00:09:44] I don't know if hyperscalers will talk about it very freely, but if they were to tell about what the utilization is, there will be lots of red faces. So there is a realization that we can't brute force everything with GPUs anymore. You need something that works with the knowledge of the hardware, of the workloads, understands the computing needs and the computing characteristics of those workloads,

[00:10:11] and be open to building an inference stack that lets you run these workloads efficiently, and using the metrics of speed, cost, sustainability, and efficiency. And I'd love to flirt with a conversation around global conflict at the moment and some of the things that we're seeing around the... What is your take on the collaboration between governments, hyperscalers, and indeed competitors

[00:10:35] in order for the US to maintain a competitive edge globally and ultimately try and maintain its AI dominance? And what does... A flip side of that, what does meaningful collaboration actually look like in practice? And where do you think the biggest friction still exists? Collaboration. I think that's a must-have. Yeah. It's obvious that a technology as foundational as AI is not something that a single company is going to address alone.

[00:11:05] They cannot meet the needs. They don't have a full vertical control over every single aspect of the infrastructure to scale AI and to win alone. Collaboration would entail not only working across the industry, but also across academia, across governments, sharing resources, creating interoperability standards so that everyone can innovate from startups,

[00:11:32] enterprises, governments, sovereigns, and they can thrive together. Now, friction is a bit more interesting. I think if I introspect a bit, there are three areas of friction as far as these collaborations are concerned. One is, of course, the hype around the technology.

[00:11:55] One has to be judicious in bringing out the practicality of AI, the places where it actually excels versus where there is a lot of hot air of what this technology can do. Now, the downside of that is if you over-promise and under-deliver, it's going to hurt the technology as a whole. So that's the first one. Yeah. The second one, I think, is the closeness of ecosystems.

[00:12:24] There is still a lot of inertia around traditional ecosystems, and the tooling that is being built today is still built on top of traditional computing architectures. So that is making it harder for smaller companies like ourselves to be able to make a sizable dent by enabling a generational improvement in efficiency so that ecosystem also has to have a sense of openness in bringing these new architectures in.

[00:12:54] And then the third one is much of this technology is still in the hands of very few people who control the different parts of the ecosystem. Whether we look at the frontier labs or the computing or infrastructure, all of that is just very contained. And it locks out smaller players and even academia. Historically, academia has played the role of, say, the more mature person

[00:13:20] or a collaborator with the industry having a longer view rather than quick monetization. AI is one of the spaces where academia has not played that role in the past 10 years or so. And that's concerning. So the real benefit of this technology will come when it's accessible to as many people as possible. Yeah, I completely agree with you. And there's also a growing conversation around chip design and manufacturing

[00:13:48] and almost as a national security issue. So where do you believe leadership in AI semiconductors or why do you believe leadership in AI semiconductors has become such a strong defensive asset now? And what risks emerge if that leadership ever was to slip? So let's break it down this way. Yeah. The value of AI technology comes from roughly three parts.

[00:14:14] There is the frontier labs, which are creating the state-of-the-art models. Then there is the infrastructure on which these models are trained and deployed. And then there is the semiconductor piece, the chips on which this whole infrastructure is built. Right now, if we look at the best technology in the world, AI technology, all three of these pieces are coming from the US.

[00:14:39] It's the US-based frontier labs running on US-based infrastructure, running on chips designed in the US. So the United States has everything right now to lead and continue to lead, drive innovation, and be a template or a role model for how this technology is deployed at scale. Now, there is also this notion of these three components are not limited to where they are today.

[00:15:09] There are foreign frontier labs running on the different infrastructure and even creating their own chips. In order for us to continue to have the best technology run on US-based components, we should not be locking out everyone. You cannot control by limiting the access to chips. It's not going to be effective. But you have to provide the balance and provide a pathway for domestic consumption

[00:15:38] so that the chips are available to the domestic frontier labs. But you are also enabling the external players to have access to it so that the next generation technology, whether it comes from the US or from another source, still works best on US-based parts. So the leadership in AI space for US depends upon just being fast innovator. People should want your components because they are the best in the world.

[00:16:06] And being able to deploy both at home and export them out so that everyone has access to them. We don't want to ship the most advanced tech abroad first instead of adopting it domestically because then we have this risk of is the technology going to advance faster outside versus at home. And as AI infrastructure inevitably becomes more specialized and more energy aware as well,

[00:16:36] how do you see the role of an engineer changing both at the chip level and inside the data center? Are there any new skills that will define a globally competitive workforce over the next few years? Do you see any big changes here? The role of the engineer certainly is evolving. And it's no longer just creating computing infrastructure or chips or architectures. They have to build a better computing infrastructure, an entire infrastructure.

[00:17:04] They have to take this view of a problem being a systems problem rather than just a chip problem or an algorithm problem. The engineers also have to be users of the same technology. They have to be interacting with this technology every day to be able to get familiar with it, to understand its needs and be able to predict where this technology is going.

[00:17:30] Now, in terms of skills, I take a slightly different view compared to some of the luminaries in the field who say that traditional learning or engineering fields are going to be obsolete. I don't think so. Foundational skills are still going to be needed. And it's important for us, even as we embrace AI technology as engineers, we have to be aware. We have to be solid in our own domain. That's when AI can give you superpowers.

[00:17:57] If we rely upon AI alone to solve problems without being the one in charge, one, we don't know how well or not well the technology is doing for us. And two, if there are weird corner cases, nobody will have the insight to actually decipher what went wrong. So foundational skills are still needed. Engineers are still going to be in demand. It's just that the effectiveness of such engineers who embrace AI,

[00:18:26] this next crop of engineers will be incredibly productive by combining their foundational skills along with the superpowers that AI can give them. And if we do look further ahead, if productivity gains from AI inference really don't compound till around next year, 2027, what decisions do you think enterprises and public sector leaders should be making right now

[00:18:51] to avoid being locked into infrastructure choices that no longer fit that reality ahead that could be one, two, three years ahead? Should they be making any decisions now? Yeah, that's a great question and observation. We have to be able to anticipate the new mix of computing needs as in an inference-centric world. We have to anticipate the growth. We have to take an efficiency-first mindset

[00:19:18] and draw investments into these efficient, scalable platforms that evolve with the development of AI technology into the next couple of years. And the risk is that if you build today's infrastructure for yesterday's and today's workloads, when the time actually comes when this infrastructure comes online, technology would have moved on and you will be stuck tomorrow

[00:19:44] with something which is not quite living up to the expectations of the present-day workloads. So the infrastructure choice really has to match the reality of an inference-driven AI technology and the demands that it creates in terms of the mix of compute and efficiency. Well, I've loved chatting with you. And from a personal point of view,

[00:20:11] what excites you about your work in D-Matrix right now? What makes you want to jump out of bed in the morning? Anything you can share about what you're working on and what we can expect from D-Matrix in the future? Absolutely. Yeah, one thing that I really like about D-Matrix is that being a really tiny company, about a couple hundred people, we have made some really big contributions. And if I were to just point out a few,

[00:20:38] we made a call on inference when everybody else was attacking training. We made a call on transformers when everybody was still processing CNNs. We identified the difference between the then models back in 2020 versus the need for an LLM and opted for on-chip high-bandwidth SRAMs as a technology. We popularized the notion of latency-bound throughput

[00:21:08] as a metric for LLM effectiveness. And then on a technology side, we are thought leaders and device initiative collectives, 3D stacking of memory, and so many other places. So that's really what excites me about being at D-Matrix is that we can really punch above our weights while being such a tiny, nimble company. In terms of what we are working on, we are working on the next generation. We are working to bring our first product into mass production.

[00:21:36] We are working on creating a programming model, not just for D-Matrix, but for the industry in general that deviates from, that addresses architectures that are not derivatives for traditional GPUs and CPUs, but follow a data flow model. So stay tuned for that. We are currently designing our next generation chip, Raptor, which is based on 3D DRAM technologies stacked on top of a logic that gives us an immense capacity from DRAM density.

[00:22:04] But while maintaining the bandwidth advantage that we have with on-chip memories. And we are also looking at bringing out our scaling technology using our network interface card, as well as working with partners on creating a reference rack level solution that can be adapted into customers' premises. There's one thing that I didn't get a chance to talk about, which is the nature of coexistence of some of these accelerators.

[00:22:33] So as part of our push into bringing out custom ASIC-based accelerators, we're not necessarily trying to replace GPUs. The latest that the industry is exploring right now, even if you look at the Grok NVIDIA collaboration, it's about heterogeneity. And that's a driver that we are also very much involved in. So we see these ASICs, such as our solution,

[00:23:01] as coprocessors for the GPU. So we're not necessarily trying to replace GPUs, per se, everywhere. Where it makes sense, we can offload certain parts of workloads, which are not very suitable for GPUs, and instead run them on these custom ASICs, which are better suited for them. And as part of that, the overall system efficiency improves. Remember that utilization that I talked about, the 10%? Yeah. If you take away the low utilization phases of computation

[00:23:29] and put it on a dedicated accelerator, suddenly the utilization of the GPUs goes up because those low utilization parts have been taken out. So that's one of the places where we would like to continue to drive conversations along with our partners. Wow. Exciting times ahead. Sounds like we need to get you back on later in the year to learn more about some of those future announcements. And before I let you go, for anyone listening wanting to dig a little bit deeper,

[00:23:59] or maybe just keep up to speed with some of the announcements as they drop, where would you like to point everyone listening? People can visit our website, dmatrix.ai. I am available on LinkedIn, as well as the company is available on LinkedIn. So please reach out. If you're interested in energy efficiency, energy efficient machine learning, check out the workshop, EMC2 workshop website as well. Sometimes we post blogs, sometimes we provide talks, and we're always looking for collaboration.

[00:24:29] So if this is a topic of interest to you, please don't hesitate to reach out. Awesome. Well, I love how at dmatrix, you're redefining performance and efficiency for AI inference at scale and making generative AI commercially viable, which I think is more important than ever. So anybody interested in collaborating or finding out more information, please check out the show notes. I'll be adding links to everything that you mentioned there. But more than anything,

[00:24:57] just thank you for demystifying this space and also talking about it in a language everyone can understand and sharing your story and the great work you're doing. Thank you for joining me today. Thank you for having me. I really enjoyed this one because it cuts through the noise and gets to the real constraint. Efficiency. Because my guest explained why low hardware utilization is a warning sign and how purpose-built accelerators can work alongside GPUs

[00:25:26] to lift overall system performance and most importantly, economics. So if you're making infrastructure decisions today, hopefully this is a conversation that's worth sharing with your own team. Because I think the choices that you lock in now, they're going to shape what's possible next year and beyond. So as always, I'll add links to dmetrics and Satyam in the show notes. I'd love to hear from you as well after you check out some of the additional details there.

[00:25:54] What do you think will be your biggest limiter for AI inside your organization? Is it going to be compute, power, memory, or something else entirely? As always, let me know. Pop over to techtalksnetwork.com, send me a message and we'll carry this conversation on. But thank you for being a part of it today. I will join you again tomorrow for another guest. Speak with you then. Bye for now. Bye.

[00:26:25] Bye. Bye.