2827: Data, Decisions, and Dagster: Nick Schrock's Blueprint for Engineering Excellence
Tech Talks DailyMarch 10, 2024
2827
25:1315.08 MB

2827: Data, Decisions, and Dagster: Nick Schrock's Blueprint for Engineering Excellence

Nick Schrock, the innovative mind behind Dagster Labs and the renowned co-creator of GraphQL, joins me on Tech Talks Daily. Nick takes us through his illustrious journey from his foundational days at Facebook, where he spearheaded the Product Infrastructure team, to his visionary leap into solving some of the most pressing issues facing data and ML engineering today through Dagster, his open-source data orchestration platform.

Nick shares insights from his experience at Facebook, elaborating on how internal tools like React and GraphQL revolutionized the company's development practices and set new benchmarks for the developer community worldwide. His transition from Facebook to founding Dagster Labs was driven by a deep-seated desire to address the complexities and inefficiencies in data infrastructure, a challenge he identified as a critical pain point for engineers across industries.

Throughout the conversation, Nick delves into the core areas of data orchestration, highlighting the importance of enabling practitioners to have end-to-end ownership of data pipelines without the need for a centralized team. This approach, he argues, is pivotal in the era where data and ML engineering are becoming fundamental to decision-making processes both in human and business contexts.

Much of the discussion is dedicated to exploring the future of open source in the SaaS-dominated landscape and the operational convergence of ML, AI, and data engineering. Nick emphasizes the delicate balance required in managing an open-core business model and shares personal anecdotes about the "Engineering Founder's Dilemma" — the intricate dance between leading the vision and running the company.

Listeners will gain a unique perspective on the evolution of data platforms and engineering, underscored by Nick's advocacy for a robust, community-driven approach to open-source development. He sheds light on the challenges and rewards of building a platform like Dragster, which aims to simplify and democratize data infrastructure for companies of all sizes.

Nick also advises technical founders on maintaining equilibrium between their visionary roles and their companies' operational demands. This episode is a deep dive into the mechanics of data orchestration and a masterclass in leadership, innovation, and the transformative power of open-source projects in addressing complex engineering challenges.

[00:00:00] How often do we witness a technological innovation that not only meets the current needs but also

[00:00:07] for sees future challenges? Well today I'm going to be joined by a visionary in the realm of data

[00:00:14] orchestration. His name is Nick Schrock founder and CTO of a company called Daxter Labs and

[00:00:20] Nick's journey for a principal engineer and director of Facebook where he was pivotaling

[00:00:25] helping create GraphQL to revolutionising data orchestration with DAGSTA presents a unique

[00:00:32] perspective on technological evolution. And DAGSTA, I also want to learn more about how they've

[00:00:38] arrived at this critical juncture where data and ML engineering are becoming incredibly important in driving

[00:00:45] decision making in businesses and beyond.

[00:00:49] But what led Nick to this path?

[00:00:51] How is Dijkstra helping shape the future of data workflows?

[00:00:55] Now before I get todays guest on, quick shout out to the sponsors of Tech Talks daily,

[00:01:00] because in today's remote first world, I think settling for out-date and managed file transfer solutions means

[00:01:06] ultimately you're risking your sensitive data.

[00:01:09] But if you are great to kiteworks, the gold standard

[00:01:11] insecure MFT, boasting FedRAMP, moderate authorization,

[00:01:16] kiteworks isn't just secure. It's a complete transformation of how your

[00:01:20] business handles file transfers and the communications.

[00:01:24] And with this state-of-the file sharing, email security, and a platform that says

[00:01:29] robust as it is user friendly, CiteWorks is empowering you to manage and protect your

[00:01:34] data like never before.

[00:01:36] So say goodbye to compromise and hello to unmatched security and efficiency and you

[00:01:41] can do that by making the switch to kiteworks.com. Visit kiteworks.com to begin. That's kiteworks.com to secure your data and empower your business.

[00:01:52] But now let's get today's guest on. Book it up and hold on tight as I beam your ears all the way to

[00:01:58] New York where Nick is waiting to talk with us today. So a massive wall welcomed to the show.

[00:02:05] Can you tell everyone listening a little about who you are and what you do?

[00:02:09] Yeah, my name is Achrak.

[00:02:11] I'm the CTO and founder of DAGster Labs, the company behind DAGster,

[00:02:16] the dinner orchestration framework.

[00:02:17] Fantastic.

[00:02:18] So much I want to talk with you about today because I was like,

[00:02:22] find out a little bit more about my guest's origin story and their journey that they've been on to what they're doing now. Can you describe your journey

[00:02:30] from Facebook to founding DAGS to labs and also what inspired you to focus on solving data

[00:02:36] infrastructure challenges? There's got to be a story there, right?

[00:02:40] Yeah, so I guess I'll start off with Facebook. I joined Facebook in 2009 and within a couple of years there, I had effectively been very

[00:02:52] involved, created this team called product infrastructure, which was to make our application

[00:02:56] developers more efficient and productive.

[00:02:59] So what that meant is that we built internal tools and frameworks that served our product

[00:03:03] engineers, the people who are actually building a website or build the mobile app, etc. And we did a lot of really

[00:03:10] important work internally and we ended up externalizing a bunch of that work through open source projects.

[00:03:16] So that is a team that created React, for example, which is now the most broadly used JavaScript

[00:03:22] framework in UI programming.

[00:03:25] And then I didn't have anything to do with React, but I was kind of across the proverbial hall from

[00:03:29] them. And then I was much more involved in our in Facebook's data fetching stack, and that ended

[00:03:36] up becoming GraphQL, which I was the initial tech lead on and co-wrote the spec, which got open source in 2015. So I'm going to leave them breed dev tools and developer productivity and

[00:03:50] developer infrastructure for a long time.

[00:03:53] I left 2017 and I was figuring out what to do next.

[00:03:59] And I was talking to companies both inside and outside the Valley, kind of in a

[00:04:03] like a beginner's mindset way, asking them what their biggest technical liabilities were.

[00:04:09] What in your mind is the biggest engineering obstacle for you achieving business outcomes?

[00:04:16] What kept them coming up over and over again with a remarkable consistency was this notion

[00:04:21] of data and ML infrastructure. And as I kind of investigated and looked into it,

[00:04:27] like I was adjacent to these issues when I was at Facebook,

[00:04:30] but I didn't work directly on them.

[00:04:33] And, you know, what I kind of,

[00:04:36] it fit a lot of the properties of problems

[00:04:39] that I care about a lot.

[00:04:41] So, you know, kind of a few things need to be true.

[00:04:44] One, they're like

[00:04:45] engineers in pain. Like when I see engineers who are working day to day, like becoming

[00:04:53] frustrated with their tools and their process or being unnecessarily slowed down or just

[00:04:57] living in chaos that actually makes me, it actually makes me fundamentally very frustrated

[00:05:02] and angry, both personally know that behalf

[00:05:05] and if I had that really motivating.

[00:05:07] Two, what I like to call a problem that matters.

[00:05:13] So I like working on problems that are kind of fundamental problems in engineering and

[00:05:19] computing that might impact millions and millions of developers.

[00:05:22] So like in the GraphQL days, it's quite server interaction. Like everyone has to deal with that. And you know, the other

[00:05:30] thing is that the, you know, with this data pipeline, which is actually the core thing

[00:05:35] that we're talking about kind of orchestrating the creation of production data assets, every

[00:05:41] company in the world has to do that to one degree or other. And also in terms of a problem that matters, like these

[00:05:50] assets that get managed by these platforms are the basis of the

[00:05:56] vast majority of decision making in modern society, right, or a

[00:06:00] lot of it, meaning that the dashboards that executives make

[00:06:04] business decisions on, you know, how health care is priced, whether or not

[00:06:09] people get their mortgages approved.

[00:06:12] All these course title functions are determined and managed by data platforms.

[00:06:17] And fundamentally, I thought they were all being built in kind of unsound

[00:06:20] ways with inferior tools.

[00:06:22] Like you kind of move from soft monitoring to data engineering.

[00:06:25] It kind of feels like you're going backwards like a decade in time in terms of the, the,

[00:06:31] the efficiency of the process as well as its integration into a proper engineering process.

[00:06:38] The other thing I really liked was properties that I like working on tech is where the technology I'm working on has

[00:06:49] kind of a strategic point of leverage in the organization. I like to think, if this thing

[00:06:56] project I'm working on doesn't shift to the org chart in some way, I'm not as interested in it.

[00:07:03] I like the cross product of technology and organization.

[00:07:06] And you know, GraphQL is a lot like that

[00:07:10] because it's this natural choke for all clients and interactions.

[00:07:13] And I thought the orchestration layer

[00:07:16] served that role as well.

[00:07:17] The orchestration layer is very interesting in data platforms

[00:07:20] and data pipeline in because it invokes every single

[00:07:24] computational engine,

[00:07:25] which in turn touches every storage engine, and that any data practitioner who's putting

[00:07:30] data pipeline or an asset into production has to interact with the orchestrator in some

[00:07:36] way, because all data comes from somewhere and goes somewhere.

[00:07:41] And so, yeah, it was kind of this unique mix of properties that made me super

[00:07:46] interested in the space. And if I really motivated, well, what a great story. And fast forward

[00:07:52] to 2024. And of course, that landscape of data orchestration just continues to evolve

[00:07:58] at a rapid pace. So I'm curious if someone has been in this game for so long that you

[00:08:02] right in the eye of the storm even now, what will key shifts have you observed and how are you,

[00:08:07] you need to be trying to address some of these changes too.

[00:08:11] So I think the dominant,

[00:08:13] you know, one of the dominant themes since the company was found in the last, you know, five, six years is that,

[00:08:21] I think there's general acknowledgement that data engineering and data platforms

[00:08:26] in general must be driven more by software engineering process.

[00:08:30] And I think we were kind of, we were ahead of the curve on that, I guess,

[00:08:36] and that gives me some, quite a bit of comfort, I guess.

[00:08:42] I think the other thing that's happening is that

[00:08:45] You know, we see tools like air flow

[00:08:49] Which kind of the incumbent in the space?

[00:08:52] you know, they are

[00:08:56] Expanding to what?

[00:08:59] We've been focused on from the beginning which is making the orchestrator a much more feature filled

[00:09:05] control plane that is much more aware of the assets that are produced by a data platform

[00:09:11] and therefore has integrated lady age, they've made the acquisitions in this area.

[00:09:14] So the competition is certainly fast and furious.

[00:09:19] But I guess the other thing that all is true, but the other thing that's happened is like a lot of the fundamental problems that we

[00:09:28] talk about still exist today. Like if you go into a team at a

[00:09:35] company that's building their internal data platform, they

[00:09:39] are still dealing with nuts and bolts issues about just like, how do I make at a very basic level

[00:09:48] that pipeline authors that I serve efficient and productive? There is still so much work to do in

[00:09:55] this area. Well, they're not curious. How do you envision the roles of data engineering, a data

[00:10:01] orchestration evolving in tandem with this growing integration of

[00:10:05] AI and machine learning and business operations. It feels like there's something happening here too.

[00:10:10] For sure. So in terms of ML, specifically, so either the kind of same thing, but I'll tackle

[00:10:19] them separately. So we're traditionally out, uh, ML, you know, among our cloud customer base,

[00:10:26] you know, you ask them what they use Daxx for four and 90% of them say ETL analytics,

[00:10:33] and that made sense. But 50% of them also say that they use for ML and 40% for what I'll call,

[00:10:40] what we call work. Well, we use these words and let them self-identify production applications.

[00:10:45] So 90% traditionally TL analytics, 50% ML, 40% production use cases.

[00:10:52] And that adds up to more than 100%.

[00:10:53] So that means multi-use case in the norm and 50% of our users are using this for ML.

[00:10:58] So one kind of theme that we see is there is article that hit Hacker News a few months ago that said,

[00:11:06] MLOps is mostly data engineering and we couldn't agree more. So we really view ourselves as a data

[00:11:12] engineering tool and that we want to do the data engineering component of the ML workflow.

[00:11:22] That means building the data pipeline.

[00:11:29] So when you're building an ML model, right? In general, you're building a data pipeline,

[00:11:30] but just the last step or maybe a middle step for inference

[00:11:33] is producing a model instead of a data set.

[00:11:35] But the fundamental kind of like process is very similar.

[00:11:39] I think the other thing here is that in AI

[00:11:43] and kind of going to more generative AI tools, where the companies

[00:11:52] are figuring out how to integrate AI capabilities into their systems, and the way that they can

[00:11:59] differentiate beyond the publicly available LLMs is to actually feed their own proprietary data into their

[00:12:08] own models.

[00:12:09] Just a simple example would be like training a model on the internal knowledge of your

[00:12:17] company so that you can ask questions about it.

[00:12:20] In order to do that, you need to be able to take the data that's in your systems and transform it into a form

[00:12:27] that your training pipelines can ingest. And that fundamentally is building a data platform and doing data engineering,

[00:12:34] which is our bit of bread and butter. So because so much of the valuable value will be having your proprietary data in a format that can be ingested and interpreted by ML engines.

[00:12:47] We think that AI only increases the value of data platforms in a company.

[00:12:53] Well, considering the current state and future potential of open source in an area of SAS,

[00:12:59] why do you see the greatest opportunities and challenges here? Anything that it stands out to?

[00:13:02] You see the great, you stop at your needs and challenges here. Anything that it stands out to you.

[00:13:04] In terms of open source in SAS, I still think the opportunity with open source is, you know,

[00:13:14] especially when it comes to dev tools is it's a adoption, accelerants, and you can get people

[00:13:22] to adopt it, especially early adopters, early in a technology life cycle, get that feedback on it.

[00:13:29] And then also open source makes the notion of adopting that tool for a critical infrastructure must much less risky.

[00:13:39] From the standpoint of the adopter, because if say the company went away or the company started overcharging

[00:13:46] for a hosted staff service or something, the company could still move back to a pure open

[00:13:53] source on-prem solution for their utterly essential business critical workflows.

[00:13:58] I think that is the opportunity of why it makes sense.

[00:14:03] Dax to Labs, you're probably an open core business model.

[00:14:06] So, um, Qs, what are the challenges and rewards of maintaining that delicate balance, especially

[00:14:12] in a rapidly changing tech landscape?

[00:14:15] Anything you can share around it?

[00:14:16] Yeah.

[00:14:17] So I think managing an open source community and the expectations around it

[00:14:25] can be challenging.

[00:14:28] One is that community has kind of a light hooded zone.

[00:14:31] You make sure that it remains like a positive good place.

[00:14:35] And that involves leadership and being thoughtful

[00:14:38] about how the community evolves.

[00:14:40] I think also, you really have to think carefully about literally communicating

[00:14:48] how you divide between like what contributions you're going to make in open source and then

[00:14:52] what the business model is and be very transparent about the incentives and how you think about

[00:14:57] things.

[00:14:58] Yeah, because I think that your users understand that there has to be a viable business model

[00:15:04] behind it, but they just want clarity understanding.

[00:15:07] They want to be able to model and predict your behavior on that front.

[00:15:12] Doing that is often a challenge.

[00:15:19] On behalf of engineering founders that could be listening anywhere in the world, I'm curious.

[00:15:24] I've asked

[00:15:25] this question. How do you navigate that dilemma of leading the company's vision versus the

[00:15:30] day-to-day running of the company? Any strategies you found effective? Have you been able to walk

[00:15:34] away from the technical side or is it still dragging you back in?

[00:15:39] Well, I mean, for me in particular, I was the solo founder and CEO for about four and a half years.

[00:15:48] And then October last year, I actually made my head of engineering the CEO of the company,

[00:15:55] and I'm CTO now. So I have definitely moved on from as much doing the day-to-day management of all our different kind of vertical functions,

[00:16:07] and that is Pete's job now.

[00:16:10] So I guess one answer is what advice you have,

[00:16:13] or one piece of advice is that,

[00:16:16] or what I did is I made someone else CEO after a while.

[00:16:21] But I could speak as it is in terms of, you know, it's a very large question when

[00:16:28] your operating as founder CEO about this balance of leading the vision versus day to day running

[00:16:36] of the company.

[00:16:38] You know, I think obviously hiring is incredibly important and trying to push down as many data ops as

[00:16:48] possible to your media management staff is super critical and I think the other

[00:16:54] thing is really important to do time management when day-to-day running into

[00:17:01] a company there's just like a certain amount of time that is

[00:17:05] going to be take doing what I'll call chores, on un-ones, etc, etc, as well as some reactive

[00:17:10] work that just always comes up.

[00:17:15] And I think you need to set realistic expectations and goals around how much of your time is

[00:17:21] going to be chores and reactive work and psychologically

[00:17:26] kind of become at peace with that and then build on that and carve out as much time as you that

[00:17:33] you feel like you need to do bigger picture vision stuff. Love that. And of course data and ML

[00:17:40] engineering are increasingly driving decision making in businesses and indeed society arguably

[00:17:45] as well. But on any best practices that you'd recommend for building robust and scalable data

[00:17:50] engineering tools, because we've seen how some get it wrong. But I'm interested in how you were

[00:17:57] or any advice you can offer around some of those best practices.

[00:18:00] Well, obviously you should adopt X here and also all your problems.

[00:18:04] Well, obviously you should adopt X or also all your problems. But no, I think that, again, it's a very broad question, but I think it's really important

[00:18:13] to think thoughtfully about the software engineering process that you set up in your data platform.

[00:18:19] And think about it from the standpoint of the day to day experience of someone who's just trying to do real work in the platform that you're kind of putting together.

[00:18:28] Like does Canvas person make a change to the business logic of the system without a massive

[00:18:37] amount of fear?

[00:18:39] Do they know if they're going to break anything before they push it to production?

[00:18:44] Just think about the day to day experience of these very basic questions.

[00:18:48] And then the other thing is how do you, once you have that sort of process set up, kind

[00:18:53] of the end goal here is to enable end-to-end ownership of data pipelines, buy those data

[00:19:00] practitioners and keep a centralized team out of the loop.

[00:19:03] Because I think the idealized state of one of these data platforms is where you have the

[00:19:07] people who are actually building the pipelines co-located organizationally with the stakeholders

[00:19:13] who depend upon those, the data out of those pipelines.

[00:19:16] And then you have a centralized platform team that's effectively an enablement team. And so I think that, you know, a clear anti pattern here is where there's one centralized

[00:19:30] data team who's just fielding requests for data sets from all different other aspects

[00:19:35] within the business and they don't have enough context to know what's going on in those businesses,

[00:19:40] those business units don't like feel like they control their own destiny. So think about it from like go through the customer experience of being a developer on the platform

[00:19:52] and think about how that developer could be self-sufficient while co-located next to a business unit that they serve.

[00:19:58] Looking ahead, if I was to ask you to look in my virtual crystal ball,

[00:20:02] what would you say are the big milestones or innovations that

[00:20:05] you might anticipate in the field of data orchestration and how might they impact the

[00:20:10] way that businesses operate in the future? And again, it's a huge question, I think it's

[00:20:14] an answer that everyone's searching for. But what do you anticipate happening here?

[00:20:19] Yeah, so I think when we talk to business leaders who are thinking about their data platform

[00:20:25] strategy, the phrase that resonates with them the most is that they need their single pane

[00:20:32] of glass.

[00:20:34] Right now their data platform is this core like a rube Goldberg machine where they don't

[00:20:39] know what's happening and then there's tons of chaos and no one knows kind of the ground truth about what's going on in the world.

[00:20:45] And they want a consolidating view that both not just in terms of UI, but also a system of record that they can program against and set up business logic against.

[00:20:57] That can reallyitched together. What before was this crazed, disorganized system into the

[00:21:05] single unified view where you can understand your data assets and pipelines that span the

[00:21:09] entire organization. I think that will make everything work much more fluidly. It has

[00:21:15] different parts of the business dependent on each other in terms of the outputs of data

[00:21:21] from one business unit are interested into the other one as well as this like massive

[00:21:27] organizational productivity purposes and be able to

[00:21:29] Like understand what's going on in the system. So, you know, that's where we're really looking forward is like, you know

[00:21:37] Really setting up our system

[00:21:40] DAGSTER in terms of being this centralized control plane, the single pane of glass, so you can

[00:21:45] kind of run your entire data platform and then leverage that to do things like cost control,

[00:21:52] understand what the structure of your system is, and data governance and having a single source

[00:21:59] of truth for all your production data assets, rather than just kind of like a widget that makes you

[00:22:04] build pipelines

[00:22:05] more efficiently.

[00:22:06] Well, a huge thank you for taking the time out of your busy day to show your insights

[00:22:11] before I let you go.

[00:22:12] I'm going to ask you to believe one final gift to everyone listening.

[00:22:14] And that is a book that we're going to add to our Amazon wish list.

[00:22:18] Every guest leaves the book that they recommend.

[00:22:21] What would you like to leave everyone listening with?

[00:22:23] Well, I guess he's a controversial figure now, but love him or hate him.

[00:22:28] I do recommend reading Isaac Sin's Walter Isaac D. One Musk book.

[00:22:33] If nothing else for the SaaS founders up there out there,

[00:22:36] you know, if you think your life is tough and the mission of your company is tough,

[00:22:40] just read about, you know, what it took to get SpaceX off the ground and put your prompts and respectable a little bit.

[00:22:46] But it's kind of the book I read recently that comes in mine.

[00:22:55] Oh, so I'll get that added straight to our Amazon wishlisten.

[00:22:59] Before I let you go, if anyone listening wanted to find out more information about outdaxedelabs,

[00:23:03] there's some different URL's, et URL's etc out there at the moment if people want to find out more

[00:23:07] information about that contact your team etc where do you like to point everyone listening?

[00:23:12] Yeah DAGSTER.IO is kind of our main our main site about the projects of the company

[00:23:18] and then if you also google you know we have a reactive Slack community. And then also, you know, the core framework is open source.

[00:23:27] So if you Google GitHub, DAG Studio, hop right into our GitHub repo.

[00:23:31] Oh, so I'll add links to all those things that people can find in nice and easily.

[00:23:35] And can't thank you enough for coming on today.

[00:23:37] Talking about the state of data orchestration, the future of open source in the era of SESS

[00:23:42] and also operational convergence of ML, AI,

[00:23:46] data engineering.

[00:23:47] So many exciting things happening in that world.

[00:23:49] Just thank you for helping me and everyone listening.

[00:23:51] Make sense of it all.

[00:23:52] Thanks for joining me today.

[00:23:53] Awesome.

[00:23:54] Thanks for having me.

[00:23:57] So as we wrap up today's enlightening conversation with Nick, I think it's clear that the landscape

[00:24:02] of data orchestration and the role of open-source

[00:24:05] platforms in modern technology are both complex and ever evolving. And Nick's insights into

[00:24:11] the operational convergence of ML and AI and data engineering alongside that delicate balance

[00:24:18] and leading a visionary tech company, for me sheds light on the multifaceted nature of

[00:24:24] technological innovation.

[00:24:26] But what does the future hold for open source platforms like DAGster and how will they continue

[00:24:31] to influence the realms of data and ML engineering?

[00:24:34] As I leave you with those thoughts, I want you to share your perspectives and join the

[00:24:39] conversation and let me know how you see these technologies impacting your world.

[00:24:45] And as always you can email me techblogwriteroutlook.com, Twitter, LinkedIn, Instagram, just at NeilCUs.

[00:24:51] Let's keep this conversation going.

[00:24:54] But that's it for today, so big thank you for listening as always and until next time.

[00:24:58] Don't be a stranger.