May 2024 Tech Talk - Scale Smart: Modern Storage for AI Workloads - DataEndure

Shahin Pirooz:

I’m joined today by Bjorn Kolbek from Quobyte, he’s the CEO of Quobyte. Bjorn could you once again give us a little background about Quobyte and what the mission is?

Bjorn Kolbeck:

Yeah, it’s kind of funny that we had a hardware issue, because when I present Quobyte we run on commodity hardware, and one of the things I always say is hardware is inherently unreliable and you have to make the software take care of that. So Quobyte is a storage system focusing on scale out high performance applications including machine learning. And we run on commodity hardware. So standard servers, no appliances, and you can basically start at four servers and grow to hundreds of servers in one cluster to deliver scale out high performance, high throughput.

Shahin Pirooz:

And one of the typical challenges we talked about a little bit ago, and I’m going to ask you to go through it again since our audience couldn’t hear, is the journey somebody takes when they’re trying to go down this path of building a machine learning platform, model, consume AI, whatever the path they want to take to get machine learning into their environment. There’s always a science project that starts out, potentially in one of the hyperscalers, and there’s some point where it becomes untenable to maintain because of cost, because of many factors, that platform in the cloud. Talk to us about what that journey looks like with Quobyte.

Bjorn Kolbeck:

Yeah, journey is probably the right word because it usually doesn’t start as a planned project where someone says, I need to build large scale AI infrastructure. That’s usually the last step. It starts with someone, data scientists, starting experiments on smaller scale data. You know, start on your local machine that has a few GPUs, maybe have a GGX, too, or you do it in the cloud where you can get the resources on request. But the problem is always the same. It starts small and then it grows. And if you’re on the cloud, you quickly realize how expensive it is. You pay for someone else’s computers, pay for the flexibility, and then you have the data component that’s very expensive that people always underestimate.

Shahin Pirooz:

Is there an easy thumb in the air meter that tells you when you get to X scale, it’s probably time to consider bringing it back in house?

Bjorn Kolbeck:

So this is totally unscientific, but in my experience, the inflection point is around half a petabyte, where it starts to get really expensive on the cloud. And it’s also, you know, as you go into the petabytes, you have the data gravity you have the problem that the egress fees are telling you. Yeah. Still $45,000 for a petabyte just to transfer costs.

Shahin Pirooz:

Yes. I think that’s just about right. So for the audience that hasn’t started embarking on this journey, and they’re thinking about doing this, how quickly do one of these projects get to a petabyte?

Bjorn Kolbeck:

It depends on what kind of data you have. If you just do small scale natural language processing, text isn’t a lot of data, but as soon as it gets into images, if you deal with images, videos, you quickly go into the tens of petabytes. So think of autonomous driving, you have cars on the road. They produce so much data from the video and other sensors that is very valuable, so.

Shahin Pirooz:

Or facial recognition or any of those, yes.

Bjorn Kolbeck:

Yeah. Quality control, you know, the amount of images that those machines or systems take in the factory. So it’s in surprising industries or I learned about the food industry. Basically they need to keep the videos around forever in case there’s a recall. So suddenly you deal with a lot of data over years in industries where you wouldn’t expect that. So I think it just keeps building. Yeah, you have to keep it around. So it’s just a lot of data and it’s growing.

Shahin Pirooz:

You mentioned life sciences is one of the areas that Quobyte does well in. A couple years ago I spent a lot of time with these developers that were building a virtual employee, let’s call it a lab assistant, basically it was, they were in the life sciences space, but these were monitoring the cages that the mice are in when they’re evaluating. There’s, you know, data centers worth of mice and they’re each in little cages and they have to check the water, they have to check the food, they have to make sure their bed isn’t wet. And the way it used to be is somebody would walk around and stick their hand in the hay and make sure it wasn’t. So they went down the path of sensors and cameras, and it was water moisture sensors, it was humidity sensors, it was sound sensors. And that was the problem we ran into very quickly was the data just exploded with images and with all the sound files, if you will.

Bjorn Kolbeck:

Yeah. And I think that’s really the, if you look across industries and life sciences is a very good example, the amount of data that is produced from images is just astounding. You have light microscopes that are high resolution. They produce terabytes a day. You have electron microscopes that are basically half automatic, so they run 24/7, you have MRIs that are used for research. So a lot of machines that generate a lot of high resolution images, and then you need to make sense of that. And part of it is machine learning. Everyone tries it, but it’s a good example. So one of our customers is Siemens Healthineers, and they use that to train their models. And I think they’re they have the most models in their machines in production. So when you go into an MRI these days, there are machine learning models that actually analyze the images to find problems with you.

Shahin Pirooz:

Right.

Bjorn Kolbeck:

So it’s, it is there in the real world. It’s pretty impressive what happens then. It’s all data and lots of images and machine learning.

Shahin Pirooz:

And having the ability to scale out on commodity hardware as needed is pretty important to being able to scale out when your data is growing out of control like that.

Bjorn Kolbeck:

Yeah. And it’s, you know, just having the data is not enough. You also want to use it.

Shahin Pirooz:

Performance.

Bjorn Kolbeck:

Exactly, that’s where the performance comes in. And with machine learning you might start small on a single box, but in the end, if you really want to solve large problems, you have to scale out. It’s the old story. We started with supercomputers and then we got into the scale out world and HPC with clusters which make HPC accessible to almost anyone. And I think we see the same thing with machine learning now. A lot of problems are being paralyzed. So you can run them on many GPUs across many nodes, and that’s how you can solve very big problems. If you –

Shahin Pirooz:

The storage needs to be in phase with that.

Bjorn Kolbeck:

Exactly. The storage needs to be able to grow and, you know, feeding one GPU is something enterprise, traditional enterprise storage can do. But what if you have 100? Your storage system needs to basically scale the bandwidth with the number of GPUs that you have. I think that’s where, or one of the problems when you start the experiment, you do it with the stuff you have in your data center, it works. And then you keep on adding, and then you go to 20 GPUs and suddenly the storage falls over and then it’s like, oh, what do we do now? We need new infrastructure. And that’s where –

Shahin Pirooz:

Huge investments.

Bjorn Kolbeck:

Yeah, they don’t have to be huge initially, but you need to think about something you need to change. And that’s sometimes a challenge for the IT teams because they’re not used to these workloads. They use databases, web servers, traditional office workloads, and suddenly there’s someone wants to do machine learning.

Shahin Pirooz:

They’re very monolithic single server workloads.

Bjorn Kolbeck:

Yeah. And you need a different infrastructure for that. Unless you want everything to go to the cloud and lose control, and basically pay that, or you need to put something on prem.

Shahin Pirooz:

So there was a lot of experiments in the early days of cloud with scale out storage from CFS offerings to OpenStack trying to do things. And we constantly had performance issues with those scale out models. How has Quobyte made it so that you can use commodity hardware, but still maintain performance as you grow?

Bjorn Kolbeck:

In the end, that’s our secret sauce. So how do you build a system that can scale linearly, so that when you go from, let’s say you have ten nodes, ten storage nodes, and you go to 20, you get twice the performance. If you go from 100 to 200, you still get twice the performance. And part of it is how we do the redundancy and reliability across the machines. That’s really where the scalability comes in. Part of it is we’re not using NFS. This is where traditional enterprise IT, they get big ears and they’re like, what? No, NFS? It’s part of a problem. It was designed for talking to a single server.

Shahin Pirooz:

And files.

Bjorn Kolbeck:

And files. Now we’re talking about 1000 machines with GPUs talking to let’s say 100 storage servers. If you have NFS gateways in between, you’re basically creating bottlenecks. What you need is clients directly talking to servers, so that you basically can use the full bandwidth of the servers and the network.

Shahin Pirooz:

And you and I had a conversation a couple months ago about one of the challenges that we face when we’re speaking to customers is are we speaking to the right individuals who are embarking on this journey? You know, there’s most of the time, like you said, it might be a DevOps person, it might be somebody who’s a data scientist. It might be – And we don’t often engage with the development teams first. And you said something very interesting to me, which is this is an opportunity for IT and engineering, technology engineering, to get ahead of the requests and requirements from their DevOps and data science folks. Can you talk a little bit about why and what your experiences have been to get there?

Bjorn Kolbeck:

Yeah, I think one of the reasons that the cloud has become so popular in companies is because, you know, it’s instant gratification. I need the resources now. I need a scale out solution now. I get it. So if the IT department can’t deliver anything at scale out, you know, traditional enterprise storage has a very different profile in terms of the workloads, also the cost. So you need to build infrastructure that can cater to this new workload. And it’s a bit like chicken and egg. If you don’t provide it, they’ll go somewhere else. So if you know that you have those kind of projects, it makes sense to start building an infrastructure. And then you have to find a solution that you can start small and grow with demand. You don’t want to invest –

Shahin Pirooz:

Or nodes versus monolithic controllers.

Bjorn Kolbeck:

Yeah, and we can keep on adding with the demand because as you said, it’s a journey no one can project, whether this will double in six months or maybe two months. So you need to be very agile as an IT department to provide the kind of storage.

Shahin Pirooz:

This is a corollary to when virtualization started to take effect. And we were, as IT people, trying to keep up with DevOps running to cloud because they could spin up a VM and get working and do what they need to do. They didn’t need to come to IT and fill out 700 forms to get a VM. And we had companies who popped out that made basically the ability to do DevOps level automation at the virtualization stack level. Major companies did their own stuff. There was also third party hardware companies that embedded virtualization into their platform. This is almost like the virtualization of storage, and to getting away from the traditional monolithic enterprise storage architectures that people are used to. So your own cloud storage at your fingertips.

Bjorn Kolbeck:

Yeah, and I would use the word storage as a service because cloud storage has a certain connotation. People often think of objects –

Shahin Pirooz:

Very slow. Not great for all workloads.

Bjorn Kolbeck:

Yeah. But what they actually want is the cloud-like experience where you provision something, you don’t fill out forms or requests and wait for it. So you click on it and then you get the resources and you can decide, I need fast storage for this project. And this is more like throughput work or it’s archival. So I think it’s that if you look at the bigger picture, it’s about building storage as a service. Not just one storage, but a storage system that can deliver scale out workloads with the storage they need. A web server can run on it. Some people have kubernetes. So the idea of consolidating multiple workloads on one platform and then making it look to the users like they have their own storage system. That’s the goal.

Shahin Pirooz:

If I’m an enterprise and I’m building out this scale out architecture, at some point I’m going to ask the question of myself, can I put all of my workloads here? So file workloads. Do you support traditional, do you have an NFS gateway that front ends your data? Do you support traditional enterprise storage in addition to the scale out models we’re talking about?

Bjorn Kolbeck:

We do. And we do have NFS gateways. The reason that I’m always saying don’t use NFS is because it doesn’t do things like failover, but sometimes it’s just easier to use NFS. So we do support those traditional enterprise workloads. Part of our background is actually that we started in the OpenStack community with virtual machines. So I think that’s also a very good second step. But I think where the customers have the biggest benefit is if they use Quobyte for the scale up workloads, where the users really want the performance and scale up capacity.

Shahin Pirooz:

Yeah, so we talked for a second about actually more than a second, but we talked about starting small and scaling out, getting ahead of the curve with, if you’re in the engineering department in IT, and having this performant storage architecture as a service that you can scale out to whatever the company’s needs are as you grow. Four nodes and some licenses, pretty simple to get started.

Bjorn Kolbeck:

Yeah.

Shahin Pirooz:

What’s the best way for one of our customers who’s listening to evaluate Quobyte and say, this is something I want to look at? I’ve been waiting for something like this. How do I talk to you about it?

Bjorn Kolbeck:

I think there are two paths depending on what kind of type of IT person, engineer you are. We have a free edition that you can just download and install. So that’s no question asked. That’s the, I don’t want to talk to anyone. I would still suggest to talk to us because we can help with selecting the right hardware. The beauty about software storage is that you can use high end, all NVMe servers or dense hard drive servers for different workloads and different capacity price points.

Shahin Pirooz:

And you can mix and match those nodes in the stack?

Bjorn Kolbeck:

Yeah. So that allows you, again, going to the storage as a service, you might need some high density storage where it’s really about capacity. Or you might need NVMe, highly parallel storage for the performance. Often you need a mix of both. And with software storage, you can easily do that.

Shahin Pirooz:

Perfect, so obviously I would be remiss to say, not to say, DataEndure is here to help. If you guys are embarking on a journey like this, reach out to us. We’ve had a great relationship with Quobyte. Love to talk to you about how we can help you select the right hardware to make this happen for the workloads that you have. We have storage assessments, dark data assessments that we can do with whatever you’re using today and make recommendations. And then obviously you’ve heard us talk plenty of times about our security practice. And the last topic I’m going to touch on is let’s talk for a second about security of these types of workloads, in specifically this machine learning AI world. Everybody’s concerned about ransomware.

Bjorn Kolbeck:

Exactly. The good thing is with machine learning workloads, a lot of data is write once. So you can just make it immutable. I think most storage systems should support that. We do. And then ransomware is not a concern to you. So that’s the easy part. I think where it gets really interesting is when you have sensitive data. So when you deal with for example, medical data or something that could facial recognition, you have people recording this. So you might have legal requirements to secure your storage above what’s industry standards. You need things like end to end data encryption. When you talk about different customers, potentially you provide services to different external customers. You need multi-tenancy to isolate them. You need to think about proper access control for the data, logging, who has access to data, if people are identifiable. So those things you really need to take care of in a machine learning environment.

Shahin Pirooz:

And how much of that does Quobyte support?

Bjorn Kolbeck:

All of it, otherwise, imagine it. But yeah –

Shahin Pirooz:

I just want to make sure for our audience they hear that. But yeah, it’s I think it’s one of the things that’s been very interesting for me in our conversations is, you know, it’s data at rest, in transit, and at the source is encrypted in your stack, as well as the segmentation and being able to handle multi tenancy and being able to slice and dice. So the storage as a service is not just a moniker, you’ve actually built it.

Bjorn Kolbeck:

Yeah. And it’s not just saying, okay, we have multi tenancy. It’s also about making sure that the security around it works, that the performance isolation works. So it’s many pieces together that then allow you to build storage as a service.

Shahin Pirooz:

Perfect. Well thank you very much Bjorn for joining us today. This has been hopefully an interesting conversation for all of our listeners. If you’d like to learn more about what Quobyte is doing and how DataEndure can help in that conversation, please don’t hesitate to reach out to us. We’d love to hear from you. This is an area where we’re investing a lot of time and money trying to help customers who are starting and embarking on this journey, or even those that have started a journey, and they’re halfway through it, or a quarter way through it, and they realize, holy cow, what we built isn’t scaling with us. There’s got to be a better way. And as we heard today, there is. So please reach out to us. Thank you.

Managed Services

Compliance

Complimentary Health Checks

In-Depth Assessments

Security & Compliance

Information Management

Cloud & Data Science

Infrastructure

Network

About Us

Partners

Learn

Connect

May 2024 Tech Talk – Scale Smart: Modern Storage for AI Workloads

Get started today!