Kirstin Burke:
So thanks, everybody, for joining us this month. We have a very timely topic around the cloud. Actually, there’s a lot of conversations right now about AI and how folks are thinking about using AI for innovation, also HPC. A lot of things going on that I think we’re starting to see the next iteration of where and how the cloud is going to be used in organizations. And so we really thought it was timely today to talk to you about how to prevent cloud costs from crushing your AI initiatives. And actually, I’m in marketing, so I probably should have said preventing cloud costs from raining on your initiatives or something like that.
But Shahin, I think as we talk to a lot of our customers and prospects right now, top of mind is AI. Top of mind is we’re really trying to figure out the angle in terms of where is it we focus, but then how is it that we execute, really to make sure it doesn’t, we don’t get crushed by it. And how do we balance innovation, performance, and budget? So I think you and I have both been talking about cloud since before it was cloud. And we’ve seen all the different cycles that cloud has really been involved in. And this is, I think, the next cycle. Where is your head? You know, when we have people start talking about this, how do we think about the cloud as it relates to all of these new things that we see going on in the market?
Shahin Pirooz:
Yeah, I think, you know, just for a second, there’s I get into a lot of conversations with vendors that say they were cloud before cloud was a cloud. And I think it probably bears a little bit of fruit to talk about what we mean by that. I remember my first forays into cloud were in 1997 at EDS when we started what was our e-business. And we didn’t really understand this cloud thing at that time. And then in 1999, I left EDS to go to work for a Oracle ASP because, again, there was no such thing as SaaS at that point or cloud or anything. And then we met when I joined CenterBeam in 2001. And you had already been there since ’99. And we effectively built infrastructure as a service, platform as a service, collaboration as a service, and many other cloud functionalities that were not yet called cloud. And we were not yet called an MSP back then because that term didn’t exist either.
This is going on almost three decades of what is today called cloud experience. And so this is not another vendor saying, we know cloud. We’ve been doing this forever. We actually have been doing this forever. We’ve felt these pains. We felt the go all in, come all back, go all in again, come all back. So this this context that we want to share with you today is really about you know personal self-experience in multiple companies and for many customers in those companies as well.
So I think you know as we talk about specifically this topic, which is high performance compute and as it relates specifically to ai workloads, it’s top of mind for everybody right now. Everybody’s trying to, it’s like AI has become the new cloud. It’s you have to have AI in everything you do. And it is the new marketing wave that everybody’s trying to say, my app does that too. And your customers are kind of expecting it. Every dialogue I’m in, everybody says, what do you do in AI?
And I think it’s important to understand that AI workloads are creating another leap in data consumption in our already excessive growth of data in the past twenty years. Because you have to take, and what we’re doing, many of us are, you know, finally able to use all the data we’ve been storing and saving but then we have to enrich that data and make it more useful, so that we can train these models, so we can educate them on our business, so we can have the, if we’re talking about generative AI, have these chat bots and generative AI solutions be able to have a dialogue that is intelligible about your business.
So the implication of that, though, is that now we have a tremendous amount of data that used to be an archive, and now it needs to come back into mainstream storage, performance storage, so that we can analyze it, so that we can educate our models with it, so that we can be able to take intelligible data out of it not just for the AI models but also to be able to make business decisions from. So, we’re finally getting to that point where our data is actually becoming information as opposed to just piles of zeros and ones. And there’s a handful of things that come up.
In my opinion, there’s three key categories that are impacted by this shift in this wave towards building AI workloads running on high-performance compute. The first, obviously, coming from a security background, is enhanced data security. We really have to figure out what data is sensitive, what we have to keep in-house versus using public AI engines like OpenAI. Do we have to get a dedicated lockdown version of it? Do we have to build it in-house? Regulatory compliance comes in. How much of the data that we’re feeding these models is restricted by regulatory compliance by our industry or by the government? And then, you know, in that same data security vein, we now have to worry about data sovereignty. Like what country can the data be from? We start dealing with GDPR and, you know, state and local government controls here in the United States that say we have to restrict it and bound it within these regions. And, you know, data from the European Union can’t come into our AI models in the U.S. and so on and so forth.
So number one is that enhanced data security that creates a lot of challenges for people. And yes, you can do all of that in cloud. And we’ll talk about in the next couple of, and I said three things, that’s one thing is data security. In the next couple of things, we’re gonna talk about the cost and the optimization from a performance perspective. So those are the three things, enhanced data security, optimized performance and customizability and cost predictability is the third thing.
Really, when we talk about how do we optimize these workloads, it’s really about what are they for? What are we trying to solve? What problem or issue is it that we’re solving? We may have some custom builds that are tailored hardware for a specific need in the HPC space, and cloud providers might not have the ability to tailor to that level. So now we have to consider, do we build this on prem? Or do we just get rid of that ability to customize and just go with what’s available in cloud?
And some of that is ensuring that peak performance is there when there’s overhead, when there’s bursts and things like that. So do you build your scale out solution local as you build out? Low latency is a factor. If we have these AI models built out and the generative AI is interacting with an application that we use and the customers are interacting with it, we need the generative AI application to be able to interact with the data in the lowest latency possible, so it feels like a human conversation, not like ask a question and wait, you know, ten, twenty seconds for a response. I can tell you I’ve evaluated AI solutions that are pretty brilliant, but my analysts already made decisions before the AI responded. And sometimes it’s right, sometimes it’s wrong. And that leads to some of the customization and optimization.
You also have to, in addition to low latency, really deal with optimizing the actual workloads themselves. So being able to tune the infrastructure specifically to a specific task that you’re trying to solve. There’s a lot of companies out there that will build multiple large language models and tap into different ones depending on the topic. And each of those large language models might have unique needs and unique requirements for the underlying compute. So that workload specific optimization is another factor here. Full use of assets. It’s, you know, when we build stuff out in the cloud do we build scale out solutions that auto scale so that we’re not paying for too much in advance and consuming it as we do. But, you know, there’s a tax associated with that. We’re paying a higher cost for smaller asset requirement in hopes that we don’t overpay for something we’re not going to use. So when we talk about full use of assets, it’s really maximizing your investment so that you’re taking advantage of what you paid for, whether that’s in the cloud or on-prem.
It’s just all of it is really around that topic of optimization, performance, customization. Really around the workloads are so different and unique and some will work well and some won’t. Latency is a factor. Optimizing to a specific thing. So there’s a lot of decisions that go in. It’s not as simple as let’s go grab a large language model, push it into OpenAI and hey, we’re AI. Yes, that works. But that’s not a simple answer.
Kirstin Burke:
Well, I think what I was thinking listening to you, I felt like it was Groundhog Day. And what I mean by that is you and I have had multiple conversations about cloud, about cloud performance, about cloud optimization over the past probably ten or fifteen years. And in feeling like it’s Groundhog Day, while what is innovating is changing, there are some fundamentals that you can’t overlook.
I’m hearing you talk about you can’t just set it and forget it. You can’t just throw something in and treat everything the same. There’s all of this. It’s really the planning that you do before you go in, the understanding of the requirements, the understanding of the apps, the understanding of the outputs is so critical. And those types of things have not changed, even though AI is the new shiny, you know, shiny star that we’ve got. There’s so many things that are the same.
Would you say, as we’ve evolved over the last ten or fifteen years, is there something that has changed in terms of how it helps us plan, how it helps us optimize, or are we still kind of basing things on the same premises we always have?
Shahin Pirooz:
So, I remember the conversations you’re talking about explicitly. And I remember in the mid early to mid 2000s, when cloud was starting to get steam, and we were talking to folks doing, well, they weren’t live streams back then they were, they were recorded webinars that we posted. But we were talking about the fact that nothing has changed. You can’t abdicate your responsibilities. You still have to do enterprise architecture. You still have to do enterprise security. It doesn’t matter if it’s in somebody else’s datacenter or cloud. You still own and are accountable for how you build your apps, how you consume the resources, how you use them, how you control costs, all those factors. Nothing has changed in that context.
Really, the only thing has changed is we’ve created some really interesting improvements in the underlying technology that allow us to do things a lot faster and take advantage of compute capabilities that didn’t exist fifteen years ago. So when we talk about, you know, HPC, really the key component of HPC today is, you know, NVIDIA wasn’t around back then, and NVIDIA didn’t have the GPU out. And the GPU has changed, really, the way we do compute because now the CPU operates the underlying operating system, whereas the GPU does the heavy lift and is the highly performant processor that is used for applications like AI. And memory models have changed where the processor architectures have changed so we can do the access and management of resources within compute in a very different way.
I was recently at a conference, which I’m hoping to go to another one soon, about high-performance compute, but it was academic. It was a lot of academics there and I learned a tremendous amount about how the academics are in research to figure out how to even better consume these high-performance compute platforms to make the workloads work better and faster, and be more efficient and decrease latency, and optimize the workload to the hardware specifically because of the hardware constraints and requirements. So it’s you know that stuff that will come out probably in the next three to five years to the general masses but it’s interesting to be at the, you know, the front end of that with these academics and seeing how they do it.
That’s the only thing that has changed is technology has advanced and it will continue. If we look at Moore’s law, it’s going to continue to enhance and move faster than our pace, but all of the underlying business and architectural decisions and constraints have not changed. You still are accountable for building a good app, a performant app, a cost-effective app, and one that is secure. So those things still fall on the role of whoever is the developer.
The role of the enterprise architect has changed a little bit because growing up as an enterprise architect in the early days of my career, we would design systems in specific categories. There would be enterprise architects for systems versus network versus databases and so on. And today that’s kind of all become combined. You have to be a jack of all trades and it’s really difficult to find someone who has experience across all of it to do a really good job.
And the cloud platforms make that even more complex because each of the three major clouds calls their different workloads something different. So you have to learn and understand, am I an AWS expert, an Azure expert? or a GCP expert, and how do I best use their high performance compute? What databases are the best? What data pipelines are the best? Is there some good data flow models that are built in, or do I have to build my own? Do I bring open source tech to the cloud? So all those factors come into this optimization and customization dialogue in building not just the application, but you also have to do the same thing if you’re going to build it in the cloud for the hardware. So even though you don’t own the hardware, you’re still picking the underlying hardware that meets the requirements of your application. So that’s really that second piece.
So short answer to a very long answer I just gave you, but the short answer to your question is nothing has changed but the underlying technology. But our responsibilities haven’t changed.
Kirstin Burke:
So, we’ve talked about a little bit about kind of the risk and the security. We’ve talked about the optimization. Obviously budget is an issue, right? And I think, what we have found through every cloud cycle, everyone’s excited, everybody jumps in, everybody starts getting the bills and says, “Oh my gosh right this is not what we thought.” Then there’s the scramble to try to figure out how do we get this model where we can take advantage of what the cloud brings, but we don’t have to overspend or over budget or whatever? How does that play into where we are going with high performance computing? How do we think about spend management? How do we think about that balancing act between where we want to go and being able to do it in a way that’s economically responsible?
Shahin Pirooz:
Yeah, that brings us to the third point that I said upfront. So enhanced data security, number one, optimizing the performance and customization of the workloads and applications. The third point is the cost predictability. And there’s three factors that really come into play there.
Cost stability is probably the first one. So really taking a look at, when we look at any kind of compute workloads in the cloud, we moved to cloud because we were getting really frustrated with datacenter sprawl and compute sprawl. And we moved to cloud and we had a new term called cloud sprawl. And it became even more hard to manage because we didn’t have direct control over who can sign up for cloud resources. There wasn’t complete visibility into the cloud. The monitoring tools weren’t great. The cost management tools weren’t there.
And there was a lot of hidden costs we didn’t understand in the early days, like data exfiltration, as an example. And so when you were talking about something like this, which is, let’s say we build our data models and do the processing and use the high performance compute in the cloud for building out our AI workload. If we need to extract data and push it somewhere into an application that’s elsewhere, that cost of moving the data out of any of the big three clouds into that application is a price per gigabyte. So you get charged for data exfil.
If your application takes off wildly, that’s great news because you’re probably making more money, but it also means your costs are going up because there’s more data exfil. If your application takes off widely, again, great news, good problem to have because your business is taking off, but now you’re bursting in the cloud starts to consume more resources and these resources are very expensive. And did you build all the proper processes to shut down those resources as soon as the workload doesn’t need them? Or are you getting billed for a full month or, you know, all these little things to think about and consider. So cost stability becomes a cost, you know, trying to avoid unpredictability of cost, like having spikes and peaks and all of a sudden saying, “Oh, my God, we were only $5,000 last month and this month we’re $100,000.”
That’s a huge scare. And if you’re a small company, and by small I mean more startup, and a startup with the intention to grow and you just all of a sudden signed a big customer. And the revenues haven’t really hit yet, but your bills just went to a hundred thousand. You’re in a whole world of hurt unless you have money in the bank. So those are, you know, some of the things.
Cost predictability is one of the things that makes on-prem a really viable solution. As well as, you know, the data, all three of the bullets I highlighted, the enhanced data security, if it’s running in your data centers, you control all aspects of it. But that means that you now have to make sure you have all the controls in place for protecting physical access to the data center, protecting digital access to the assets and resources. So you get more control, but you also get even more responsibility. So that’s something important to pay attention to.
The cost predictability side of this thing, though, if you’re buying via capital expenditures hardware, you know that you’re building to peaks. And you know what the costs are associated with the platform to operate, and you can build that into your business model. And you could even do leases to support scaling up. But you have the ability to now run this platform in such a way that it aligns with your business model and not have hidden charges and costs that spike all of a sudden. So there’s a huge reason that a lot of companies are now thinking, “Oh, my God, I can’t run HPC in the cloud.” And this is the primary factor. The data sovereignty factor gets hit. The regulatory compliance component gets hit. There’s a lot of factors there.
There’s also technically a lower TCO when you’re looking at building on-prem, because there is a high initial investment which can be obviated if you’re using a lease model, but it potentially really lowers the total cost of ownership of the platform because in the years to follow, if you’re getting five years out of hardware, your five-year cost of that investment is going to be much lower than operating in the cloud. And even if it’s twenty, thirty percent, that’s twenty, thirty percent to your profit margins, especially for high demand workloads. If you’re actually consuming those resources, it’ll be much cheaper. So if you were hitting the peaks regularly in the local infrastructure you built, that’s a pretty important factor.
The other is you don’t get lock-in to platform. So like vendor lock-in is what I’m talking about. Having the ability to not be locked into how AWS or GCP or Azure has designed and built their versions of the components that you need, your data pipelines, your storage, your databases, your AI model components. You could be consuming all microservices from those players and then be locked into that and it’d be very difficult to move. Versus, if you build it on-prem with your own open source technologies or even commercial technologies, you have control over that technology and decisions when it’s time to change out of that technology.
What is ultimately beneficial in looking at an on-premise solution versus building it in the cloud is that hybrid flexibility. So you can still take advantage of cloud by creating a hybrid capability in your platform, in your application, such that if you don’t want to build to peaks on-prem, you can absolutely spike to cloud for the peaks as opposed to running in cloud 24×7.
Kirstin Burke:
So one of the promises of cloud is time to market. And, as we talk about the viability and attractiveness of on-prem, or some of this or a lot of this, what comes to my mind is, well, how much does that slow you down? Or what is it you can do to accelerate that build? Because if your competitor is, say, deploying something in the cloud or doing that, does that become, how do you make that not work against you competitively?
Shahin Pirooz:
So, short answer is we’ve seen it happen time and time again, and I’ll talk about, you know, some big companies, we looked at Netflix. Netflix started all in cloud and they built a phenomenal platform. They beat the heck out of the competition. Their time to market was better than everybody, and they changed the way we do video streaming forever. But Netflix, they came to a point where they said, this is way too expensive to run in the cloud. And they took about half of their infrastructure and moved it to on-prem. So they built in that hybrid model to be able to scale up and scale down. And then eventually as they evolved, they really use the cloud for peaks as opposed to 50-50 model.
So I think there’s a value in starting in cloud, but engage somebody like us that has the experience to be able to talk to you about using open source stacks in the cloud as opposed to the cloud native stuff. So that if you do decide to bring it on-prem, that’s a transition that’s easy. And then when you do decide to bring it on-prem, allow us to help you architect the solution with the right technologies from a hardware perspective and a software perspective to enable an on-prem solution that continues to meet the same demands and needs that you built in cloud.
And the decisions you make upfront to get that competitive advantage of time to market, can be technical debt that slows you down when it’s time to make this decision because you have to rewrite your stack, redo the platform and transition to it, if you decide that later you need to move on-prem. It was not a cheap endeavor for Netflix to move to fifty percent on-prem because they had to re-architect a lot of what they did to work on-prem. So you can make those decisions upfront as opposed to making them later. And it’s very simple things like if you’re using cloud native work flows for data pipeline management, there’s open source equivalents that the clouds are probably using under the surface that you can use. And those open source equivalents will run just as well on-prem as they do in cloud. And you have to manage the open source platform, so it adds a little bit of overhead to you, but you can run that easily in cloud. And then when you transition on-prem, it’s an easy transition to on-prem. You don’t have that vendor lock-in.
Kirstin Burke:
Well, and I think no matter what space you’re looking at within IT, that technical debt is what kills everybody, right? No matter what space you’re looking at, you know, security, storage, whatever. And so I think those organizations that are able to, to your point, up front, really think about those investments and think about where that flexibility up front will help them maintain that flexibility on the back end. What are those things that we may anticipate switching out, that we may anticipate will become outdated quickly, therefore we would like more flexibility, is really important.
Shahin Pirooz:
Exactly, 100%.
Kirstin Burke:
What would you say, so obviously the planning up front, the strategy up front, for someone who’s thinking about this, and who maybe is in the early stages of planning or maybe who’s already made some investments in terms of where they’re going with AI. As someone who’s talking about this all of the time, who’s helping customers with this all of the time, what would be two or three things you might leave people with to think about or do as we wrap this up?
Shahin Pirooz:
So I would say, let’s go back to the notion that what has stayed the same and what has changed. What has stayed the same is accountability and responsibility. What has changed in terms of people is the role of the enterprise architect has become more complex. An individual enterprise architect, unless you have a team of enterprise architects that are specialized like the old days, has to think about all the moving parts associated with that enhanced security, performance optimization, cost predictability, and then hybrid flexibility. So if somebody has to think about all those moving parts, they have to be experts in compute, storage, memory, network, cloud, the cloud specific microservices, scale out compute capabilities, CICD, DevOps, all of those factors come into play. And it’s really difficult to find one person who’s an expert in all of those things.
That’s where companies like DataEndure can really help because we bring in the resources as required along the way and can pull a team together, that you can tap into fractionally when you’re having these dialogues and conversations, that enhance and enrich your dialogues internally with your own internal architects and give you things to think about. Give you guidance in terms of what we’ve seen, what we think, how we feel scale out really works. And so that is really if, it’s not..
The three things, if you want two or three things people to think about. Those three factors I set up front: data security, performance, and customization, optimization and cost predictability are the things to really pay attention to when you’re deciding a direction to take in these applications. But how you make those decisions, it’s really understanding that it’s difficult for any one person on your team to understand all those moving parts is probably the biggest key takeaway I would give you in this. I can tell you there’s no way I could do that with a single individual, even though, personally, I’ve touched every one of these things across my career and have had to deal with these things across my career. I wouldn’t even do this just for myself by myself.
I would bring in a team of people who understand what they’re doing. And that’s really the value proposition of somebody like a DataEndure, that you don’t have to hire, six different architects or enterprise architects. You can tap into the ones you need when you need them and get them fractionally.
Kirstin Burke:
Got it. Well, thank you so much for your time. These sessions that we have are always so helpful even for me, and I’m with you all the time, but I always learn something and I would expect that our viewers have as well. And as always, if there’s a takeaway here that any of you have that you want to explore, that you just want to run by us, that you’re thinking about doing X, Y, and Z, and what do we think, we encourage you to reach out to us, to Shahin. He can hook you up with the best people on his team. We really would like to help folks get into a position of success earlier rather than later to discover any flaws or mistakes that might come up to bite people later. We would love to help you architect a success plan early, and help you get those business outcomes you want. So thank you for joining us, Shahin, appreciate your time, and we’ll see everybody next month.