Kirstin Burke:
The theme of our TECH talk kind of plays around with Halloween as well. We are calling this Digital Ghosts and what the risks are of abandoned data. And before this call started, we were just talking a little bit about data and trying to visualize, trying to get our heads and hands around the size of data we’re talking about here. Like what is the challenge that we’re talking about?
Research out there says that this year alone, we as a world will be creating a 147 zettabytes of data. So I didn’t know what a zettabyte was. Shahin did some research into what a zettabyte was, and the easiest way to put your head around it is one zettabyte is one billion terabytes of data. The mind can’t fathom how much data that is that we as people are creating every year. When you think about the implications that AI are starting to put on that data creation too, it’s just going to balloon. So when you think about data and when you think about 80-90% of data out there is unstructured, you start thinking about this massive pool of data and that it is not managed or maintained or secured in a way that is sustainable for organizations. And it’s a hotbed for cyber adversaries.
And so we’re just going to get into this a little bit and talk about really what is this problem? What is unstructured data? Why is it a problem? How does it happen and what can we do about it? So I’m joined as always by Shahin Pirooz, who is our expert in all things security and technology. Hello, Shahin.
Shahin Pirooz:
Hi, everyone. How’s it going, Kirstin?
Kirstin Burke:
It’s going great. It’s going great. I’m just going to open it up to you and say, can you explain to us how should we think about abandoned data? Like what is it? What’s going on? How do we get here?
Shahin Pirooz:
So first let’s start with, how do we define abandoned? It’s, you know, it’s not like we left it on the side of the road and drove away. It’s in the industry commonly called ROT data, which stands for redundant, obsolete, or trivial data. So how would something be redundant? If we have it in other places and we’ve made copies of it. It becomes obsolete when we no longer need it, and no longer need it comes from a lot of things. For example, it might be that we have a regulatory reason to keep something for, let’s say, three years. And after that three year period, there is really no need to keep it. And if there’s no business need and no regulatory need to keep it, it is obsolete data. And then trivial data is something that really adds no value. It’s just data that we’ve collected over time. The other thing that falls into that same category is forgotten data, stuff that we just kept piling somewhere and it can fit in any of those categories. Or it could be important data and it could be data that is sensitive and risky data. So the concept of digital ghosts is really this concept of not knowing what we have and where it is. That’s the key.
Kirstin Burke:
So how did we get to this place where 80-90% percent of our data falls into this category, right? I mean, we set up procedures, we set up processes, we set up protocols, we buy tools and tools and tools and tools to manage our data. And yet we have a significant amount of ROT. What is it that prevents us from knowing more about this data or managing it differently?
Shahin Pirooz:
The easiest way to wrap our heads around what happened is we’re digital hoarders. It’s just like when you pile the old pictures and papers and everything in boxes and put them in the garage and then they follow you to your next three moves and you never open those boxes, but the boxes just get on the moving truck and go with you. We’ve done the same thing with digital data. Digital data is we get to this point where we start to be concerned that we might need it someday there might be some designs, plans, contractual language, any number of things or, you know, some invention we created that we might need to recreate and we don’t want to forget it. Any number of things that could lead us to believe I can’t lose this data.
And it starts small. It starts with, you know, design diagrams. It starts with contracts. And pretty soon it’s everything. And, you know, it’s now take that and propagate it across however many employees you have, and every one of them is doing the same thing. And they’re probably keeping their own copies of the same data, which is where the redundant comes in. So not only do you have a central location file server where people, departments have shared data, but some people are keeping copies of that on their own system. Or they’re making copies in another folder so they remember where it is and can access it easily. Or they’re afraid the company policy is to delete it and they want to save it for themselves. So there’s a lot of factors that get us to where we are. And there’s not a lot of time and energy we’ve spent in trying to fix this problem other than archiving and backup. And so we use, to your point, we keep using technology to try to solve the problem. And I think that technology is a helpful tool, but there’s other things that need to happen in order for it to properly be managed, so it doesn’t get out of control.
Kirstin Burke:
Alright, so, you cued yourself up on that one. You know, let’s talk about what are some steps that organizations can take? I mean, I’m hearing you talk about the causes, right? And I think about even simple things. For example, DataEndure, we’re forty years old, right, this year. We have not had the same employees. We have many of the same employees, but we have not had the same employees over the course of forty years. So even those employees that come and go, you know they all have their hordes of data and then they go and they leave the data, and that data compounds. So there are all sorts of different things, to your point, all sorts of different ways that we contribute personally as well as operationally, organizationally to this issue.
Shahin Pirooz:
There’s also one additional factor which is, you know, it’s marketecture. It’s the global marketing or the ecosystem saying, you know, starting with the Yahoos and Googles of the world, that data is all the value of these days. You have to have your data and then, as we progress through this, we went into data lakes, business intelligence. Fast forward to today and now we’re talking about augmented intelligence trying to help us to process that data and get meaningful nuggets out of it. So all of the piles of data that we’ve kept, we now start thinking about it in the context of, is there meaningful gold in there? And do I want to go mine those hills, literally? But the fact is, there isn’t a ton of meaningful data in there. The meaningful data for training our large language models and creating RAGs to help train them is really things that we specifically care about, so the training that the data is helpful from more structured data, from databases, from, for example, if you’re creating, you want to create a help desk bot, you go to your ticketing system and look at the tickets that you have and then look at the responses that you have and then you put standard operating procedures in for training. But you don’t have, you know, a petabyte of standard operating procedures you have two terabytes or less. So I think that’s what exasperated this whole thing is we did pile and hoard, and then all of a sudden we said maybe it’s actually meaningful data under those hills. And those digital hills just keep getting bigger and bigger and bigger and we don’t delete anything.
So coming back to the question you asked, I think there’s really only five areas that people need to focus on and it starts with knowing and understanding your data. So, you’ve got to audit and clean your data on a regular basis. You’ve got to get to a point where you know what data you have, where it is, and you get rid of it when it’s outdated or not relevant to your business. And that’s not periodically, or on some haphazard schedule, you have to set schedules to say, for example, every quarter, I’m going to do a cleanse, and we’re going to get rid of anything that’s older than, let’s just pick a number, one year, two years, three years, you pick it. And if you’re doing that on a regular basis, it creates a culture where the data that’s important is maintained and touched. And if it hasn’t been touched in, you know, thirty six months, is it ever really going to be useful? Maybe in some one off chance. But is that something that can be recreated or not? The second thing is to set clear retention policies. So I just talked about three different retention policies, one year, two year, three year. But setting a clear retention policy that is based on both legal and business needs. And legal might be regulatory or it might be contractual, tied to what you’ve committed to your clients that you will keep the data. So maintaining a policy that is super clear so that your archiving and retention and deletion systems can be set to match that retention policy is really important.
Kirstin Burke:
My question around this is, these retention policies and the schedules that you put in place, obviously if it was just up to IT, right, that’s easy. You know, you click a button, you schedule it, it’s done. But you have these pesky employees, right? And whether it’s, you know, no matter what level of the business you are, different people value different data, right? Oh, this is really important to me. This is really important to me. And now all of a sudden you start putting tension on that with, okay, now we have a schedule. How do you manage the data need and the employee variables?
Shahin Pirooz:
That’s a great question, and it’s the fifth thing I was going to talk about, so we’ll just jump to that one. No, it’s okay. The order isn’t really important, but it’s just like anything. When we talk about security, we talk about security awareness. When we talk about safety, we talk about awareness of where the exits are and fire alarms and all that. The exact same thing has to be true for data. We have to create this awareness around data hygiene and what not only our best practices, but what are the policies and procedures that we as a company are implementing and why. And the why is really trying to get a handle on forgotten, redundant, obsolete, trivial data.
And it’s all about reducing risk because as much as we think data is gold, data is also gold to the hackers. So when they ransomware an environment, what they’ve done is they’ve exfiltrated data from your environment and are holding that hostage until you pay them the ransom. And if you don’t pay them the ransom, they put it out on the internet and let other people get access to it. And if you have customer data or sensitive data in there, now your reputation is impacted as a company. So this is much bigger than Joe really needs to hang on to his data. Does he really, and is there risk associated with hanging on to it? And there’s, you know, when I mentioned earlier that it needs to align with legal and business needs, people often say, I have to keep this data legally. The question is, do you really? Because when we go back to FRCP, which are the federal rules for civil procedures, they say that your data retention has to match your data policies, and your policies have to be written at least six months before a litigation. So you can’t, after an event happens, change your policy to be ninety days and delete all the old data, to set your policy to ninety days.
And I can tell you that I’ve worked with clients that have ninety day retention policies in email, meaning they delete everything after ninety days, whether the employee likes it or not. And I remember my first reaction to that was, wow, that seems very aggressive. And they said at first it was and everybody reacted as you just did, but probably more viscerally. But people adapted and people learned, and they realized they don’t miss that email and they rarely went back to look for something anyway. And they had implemented archiving tools for that data that was important for retention, for legal reasons, for litigation. And they were able to put legal holds and so on and so forth if something needed to stay longer than ninety days. So I think that coming back to the comment you made, just like security awareness training, we want to do data hygiene training and educate our employees not only on what is ROT and why it’s a bad idea to make fifteen copies of the same file, but also the risk associated with that when IT and security doesn’t know where that data is and the bad actors compromise their system that has a lot of proprietary information on it.
So, kind of coming back to this from the top, audit and clean your data regularly, set clear retention policies, and adhere to those policies that are used for cleaning the data. So those things kind of are like a circular loop. And then the next thing is, you know, we keep talking about zero trust these days and we keep talking about explicit access versus implicit access. And that should also be true for files or data. Limit access to who should have the data, and specifically if it’s sensitive data, and continuously review permissions. There’s a lot of tools out there that will look at Active Directory and evaluate who has access to what file systems.
You know, the most common problem with ROT is, Kirstin, you might be working in finance and then you decide marketing is your joy and you go move over to marketing, but you still have access to all the financial data because nobody went back and cleaned up your access to the files for finance. So given that you were in finance, there’s probably some sense of comfort and trust in that they hired you in that position. So you probably already know all this data. But going forward, you shouldn’t have access to it. And when you bring security into the mix, all it takes is for your system now to get compromised, which might not be as controlled as the finance systems. And now they have access to the finance data, which is something that could be huge exposure if it gets exfiltrated and posted on the internet. So limiting access and regularly reviewing the permissions associated will help to reduce and ideally avoid exposure of data. Because you don’t always, when people move around the company and the larger the company gets, that happens more often. So it becomes even more and more important to review that.
And again, all these things, like for example, the cleaning the data, the regular deletion, or archiving of data, all of those things, including limiting access and review of access need to be reviewed on a consistent schedule they need to keep happening regularly. They can’t just be, you know, oh yeah we need to do that data thing. It’s got to set not just the policy but the procedure that says here’s how we’re going to review, when we’re going to review, and what we do post review with that information.
And then, coming back to the tools topic, it’s you’ve got to use data protection tools in order to protect your data. There’s encryption, DLP [data loss prevention] solutions, backup solutions, and archiving solutions. And backup and archiving should not be confused because sometimes we move data to an online archive in order to put it on cheaper disk, to put it someplace where it’s not accessible.
And there’s two key factors to pay attention to in the archiving space. Number one, separate your sensitive data from your regular data and set your policies for archiving retention so that you can delete the data, auto-expire it. But number two, encrypt that sensitive data. Don’t assume just because it’s offsite and it’s in AWS, for example, in Glacier, that it’s protected. Encrypt it so that even if it gets compromised, the bad actor can’t do anything with it. DLP has more and more become a complex platform as our data grows. It takes longer and longer to process the data when we were talking about terabytes of data and not petabytes of data in an organization. DLP could process all that unstructured data and determine what it was, and where it was, and do it with some level of efficiency. Now it takes so much effort, and so much time, and so much tagging, and applying meta tags to the documents to say if they’re sensitive or they’re proprietary or they’re public or whatever.
In my opinion, DLP solutions need to be looked at very carefully because the traditional DLP models, which went and scanned the entire network for files and brought the metadata into a database and then applied policies and that those policies were monitored with regards to the movement of data and the access of data. Those kinds of things, I think, are broken in today’s world. So the way I think about data loss prevention is not about, many times DLP was considered a data leak prevention. Don’t let the data leak out because somebody copied it to a USB or email or whatever. Whereas I think today what we should be looking at and the way we approach it ourselves is, we look at the scan of the analysis of the data and map it to regulatory concerns and policies. And that helps us understand what the risk is if the data gets compromised. But then we can take that risk or sensitive data and encrypt it with the push of a button. And so even if it does leak, it becomes useless unless you have a system with the proper tool on it to be able to see that data. So I think we need to adapt more.
Encryption has been one of those things where it’s become a cyber insurance policy that says, make sure your hard drive is encrypted. That’s great. But the minute something comes off your hard drive, it’s unencrypted. Whereas what I’m referring to is that data is protected and encrypted, especially if it moves off the system. Because you need to be able to have some level of comfort that if the bad actor takes your data, they can’t do anything with it. And in a previous talk, we talked about how you also need to be quantum ready with your encryption because they’ll take it and sit on it until quantum computers are capable and cheap enough that they can use them to decrypt your files.
Kirstin Burke:
So this is a heavy lift. And probably depending on how astute you’ve been up until now, managing your data, right? Some people might say, oh my gosh, I don’t even know how to get started. Some people might say, okay, I’ve got a couple tips here that I can go in and apply and it’s not as big of a deal. And when I hear you talking about all these tools, it’s interesting whether it’s backup or storage or whatever, it’s like these tools have been here to help us scale, yet they have also enabled us to have bad hygiene for years and years and years because like, okay, so I just buy more storage or I just continue to do more of what I’m doing and buy more to accommodate this growth of data rather than, to your point, kind of taking that hard line and picking what it is and how it is and starting to do things different.
How does someone, you know, just in terms of analyzing what you’ve got and figuring that out, how does someone assess or maybe grade themselves? How am I doing today, right? What do I need to do different? How would someone get started to figure out what level of ROT they have and what they need to do to put themselves in a more secure position?
Shahin Pirooz:
There’s a couple ways. One of them is the risk approach which I was hinting towards and and the way we do DLP, which is understanding what data, so the key thing to start any of this is, know what you have and where it is. So knowledge is golden as in any scenario. The next thing is, in parallel, go and start figuring out what your legal constraints and regulatory constraints are. How long do you have to keep data? Like, really, what’s the real thing? And talk to your legal counsel and to your compliance people to get that answer. And then implementing the tools, then we come back to this notion of there’s plenty of tools out there that will scan your file systems for access levels. So that’s one side of it, restricting access and getting explicit access into the environment. Who has access to what files and should they? And then being able to clean that up on a regular basis.
The second is the data protection itself. So the risk based approach says we’re going to look at what data we have, where it is, and what risk level it is associated specifically with ransomware, which is what is top of mind for most of us. Then the other side of it is the archiving of data. So we’ve got all this ROT, we probably should move it off of primary expensive storage since it’s so much data, and move it to an archive environment, and then set a policy that expires it from archiving based on the policies we set a few minutes ago. So then you look at what we call dark data assessments, which is looking at your data, analyzing what is, from an access basis, so how often is it being accessed? When’s the last time it was used? Is it not being touched? And if not, let’s, you know, it hasn’t been touched in six months, a year, three years, whatever time frame you choose for archiving, we move it off to the archive and make it still accessible. It can be brought back if somebody needs to use it and we could even do tiered archiving, which is move it to offsite storage like S3 that is still performant, then take the next layer and go to Glacier or equivalent and other files, so that it is archived and it’s not intended for access, it’s intended for offsite tape storage.
So you got the security side of it, which is let’s take a look at our data risk, it’s the ROT side of it, which is let’s look at our redundant, obsolete, and trivial data and then archive it. And then you got the policies which tell us when we can get rid of it and not have to hold on to it anymore. And implementing all of those things as well as who has access to it. So implementing those four things, there’s really three or four technologies I just described that can help close that gap, and in many ways automate what we’re talking about. The one that is the most complex, I would say, is the access because somebody has to review the access. It’s not something that can be automatically done, but somebody has to review the access to files and folders and directories and then change those based on, you know, if you’re in a department that shouldn’t have access to this file, let’s remove your access.
Kirstin Burke:
What is your experience when organizations are getting serious about tackling this? Where is it that you find people need help? So a lot of this, maybe they feel like their teams either have the tools in place or the processes or the experience, like we can go tackle that. Where is it that you find organizations raise their hand and say, you know what, this isn’t our thing?
Shahin Pirooz:
I’m going to give you a corollary, and it’s very rare that somebody says this isn’t our thing. It’s usually when somebody’s overwhelmed and they’ve been asking for information about it that they’ll say we don’t have the right tools or resources to get the data we need. And the data we need is what is our data risk, for example. That could be a simple question. So that’s when we get pulled in. But I’ll give you a corollary to what seems like a very simple thing.
So when I talk to customers about email security, every single customer, every single one to a T says we’ve got email protection. And when I dig in deeper and find out what it is, usually it’s gateway-based email security and simulated phishing attacks to train the user. So security awareness and gateway security. It would be like having an antivirus solution on your desktop and saying, I’m protected in today’s world. And similarly with data, when you talk to someone and say, I want to talk to you about your data and your data risk, the first thing is, or risk usually, you know, sparks something and people get nervous. So it’s one of those fear, uncertainty, and doubt things. And they don’t really understand their data risk. But when you talk about data protection and archiving of redundant data, the reaction is we back up our data. We’re good. And it’s a very similar thing.
Moving this data into backups and keeping those backups forever is the same problem. Or if you have it just on tape and you hope that those tapes work when you need to bring it back, that’s another concern area. So having a good understanding of what your data protection is, what your data security is, and making a business decision based on cost, risk, whatever metric you’re going to use in your company. For every company, it’s different. Don’t assume just because you’re doing something today that it is protecting you in all the areas I just talked about. Because all it takes is that you are backing up your data, but you haven’t moved any of the sensitive data into any secure place, you haven’t encrypted it, and the first time a bad actor gets in your network and takes your data, that sensitive data is now held hostage.
Kirstin Burke:
Right. So it’s rare to find someone out there who’s doing nothing. Most people are doing something. And so it’s really peeling those layers away and understanding of the solution that you have in place, is it the right strategy to deliver the outcomes that you’re hoping to achieve with the data?
Shahin Pirooz:
Exactly. And that’s a unique journey for each company and what they value as important. And we’re not here to be the data police or the cybersecurity police. We’re here to help. All the things that we put together, all the practices and assessments that we put together, have always been around this notion of let’s get visibility, not to call someone out and say you did something wrong. You identify what the gaps are so we can help you close them.
Kirstin Burke:
Right. Well, and I think you mentioned it, but it’s important to note for anybody listening, we do have a number of different assessments that are complementary that really, to Shahin’s point, help illuminate and just give you a high-level perspective of what have you got, what gaps might you have, what issues might we want to solve in a very collaborative fashion, where this is not to get anybody in trouble, but it’s really to help you further protect your business, your data, your operations, things like that.
So for anybody listening who here’s all of these five areas that we’ve gone through and is thinking, gosh, you know, I’m not sure in these two if I’m okay. We would love to talk to you. We’ve got a ton of expertise. We’ve got a lot of experience doing this. And, to Shahin’s point, everyone’s journey is different. And we have walked this path with hundreds of different clients, and so really can help you understand where it is you might need to go. So Shahin, thank you. You don’t have to be afraid of the digital ghost. We can help you peek behind those corners and turn the lights on and make sure that we get you in better shape to make sure that you mitigate that risk. Thank you, everyone.