Presenter:
Transcript:
Hello, everybody. If you've been going through track nine. Welcome. If you're new. Welcome. We've got an awesome talk for you today called Blog's Lifeblood or Biggest Problems by James Cabe. A little bit about James. He's a veteran technologist and cyber security expert with roots in Oak Ridge, Tennessee. He began his career at BBN Planet in Cambridge, Massachusetts.
One of the internet's original backbones. He's provided network security consulting in New York for law firms, trading networks and global retailers. And he spent nearly a decade in Houston working in oil and gas operations and industrial control systems before moving fully into cybersecurity. So please welcome James Cabe.
Wow. That was a really good, really good, introduction. So it almost makes me sound like I'm a big guy. That's supposed to be a joke, right? So, yeah. I tried. So these are all going to fail pretty badly. So I know logs are the most scintillating topic that you're going to hear. In this whole conference, I'm sure.
So, just to give everybody a little bit of hackery in this, you can go and pull up your phones and, pop your Wi-Fi real quick and see that, nothing is safe here. So, why do why am I talking about logs real quick? Everybody, this is logs are not necessarily a hacker topic. It is really about the blue team and helping them fix some problems when it comes to the, you know, keeping the bad guys away.
So that's kind of how it's tangential to this whole thing. It's been one of the biggest things. I did a talk at Sink Saint Con, which I highly suggest. It's very similar to huge set con, except for there, like I said, one of these bad conferences where you have to do a whole lot of soldering. Still a lot of fun.
It's out there in Utah. I highly suggested if y'all go to like this one, you'll like Saint Con over in Utah. I gave a talk there a couple years ago, and I said the three biggest things I would like to kill in my own career. I'll feel accomplished if I do the first one's passwords. They're already on the way out.
Yay! I worked for Microsoft for a little while. Everybody clap. No more passwords in the relative near future. Maybe we have the answer to it anyway, right? The second one is syslog. I hate syslog. Man, that is a 51 year old protocol that just needs to go away. Right. And then, number three was. And this may get a couple of eggs thrown at me, especially by Microsoft people.
I used to work for them. So, I would like to get windows operating system out of every bit of critical infrastructure on the planet. So even did a start up to do that? Nobody funded it. So, yeah, it was a brand new HMD that was, only run on K3. S so but, nobody bought it.
So anyways, that's where I come from. It's kind of a background and I've always been a technologist, somewhat of an inventor. So we're going to go through what I've been working on lately when it comes to the answer to logs, because I realized it was a huge issue in the industry. And while you have startups like cripple making $315 million in their EA round, and they don't really, truly take care of the problem.
They take care of making it cheaper, but they don't cut any of it out or make it more usable or any of those other things. So my objective is to do all those things, and then show you guys a little bit how to do it yourself. So in this room, how many people are leadership types of care about budgets?
Raise your hand, please. Okay. There will be some of this in here for you. Anybody that's technical and loves hacker stuff and codes and likes to play with Jupyter notebooks. All right. A lot of this will be for you too. So I'm going to do it. Has you used to make this thing pretty democratic, so to speak. And so, so to speak.
Actually, I like questions, especially during my talk. I'm not an evil scientist, nor a galactic overlord, so I do not like to monologue that much. So, there will be times I ask questions, so please feel free to answer. I like the questions, so if you got one, you can always raise your hand. In the meantime, I find myself because I've been in sales.
I was with Fortinet for almost ten years, and most of it in sales as opposed to product, because they realize, like, I could talk. They actually call me the mouth of the South. Go figure. So they put me in sales after that got me out of product. Because I talk too much. So because of that, I find myself doing sales terms.
So if I find if you if you find me doing any sales terms, anybody think it's a bingo? If you take a picture of this and mark out the bingo, if you get a bingo at the end of it, I will give away a prize. So at the end of the speech. Okay, I have, backdoors and breaches for OT, card deck that I can give away or I've got a if you pull up your Wi-Fi and notice that you just got Rick rolled on your Wi-Fi.
I've also got an Esp32 marauder. If you guys want one of those. So, please play bingo if you want to. It should be fun. Plus, it makes you pay attention. All right. Anybody remember this cartoon? Okay. Sort of. Okay, that really dates me because it was made originally in 1952. And probably means I watch way too much, VHF TV, once upon a time, because that was usually played on, Saturdays.
So, the bear hates noise and he really hates noise, so he screams and yells at all the time. I do too. After having worked for, one of the OT cyber companies, one of the outcomes is now cyber physical security systems. I worked for them. They got bought by Microsoft. Even at the time, all I heard was, man, this thing sure does make a lot of logs.
And it was the one of the least log making ones of them. And I realize they get stuck out in the field, and I had to do some incident responses on contract after I left Microsoft. For one of the big three, companies. And, you know, they had their contractors to do all the real hard work, and then they put the real pretty people in front of kind to present the real hard work.
I was one of the guys doing the hard work to do the instant response and do the write up on it. Well, I found on almost every instance that they had one of these cyber physical protection systems, and it had log the issue on the pipeline before it actually went down and the billing systems got got right. But nobody saw the logs.
It didn't go in anywhere. Central. It was stuck out on a sensor out in the middle of nowhere. Right. So there's a bunch of reasons that happen. So if we want this stuff to work, we actually have to make sure we actually see the stuff working. Right. And so hence the only thing we usually have to rely on is logs, which again I want it to go away.
Right. So this is all the stuff that can make a whole lot of noise. Sorry for anybody that might see some of their brands in here. I love all these brands, otherwise I wouldn't put it on the screen. But, you know, we do tend to make noise. No matter what goes on. So this is your typical type of OT style network, with OT protection on it.
It does create a lot of logs as a matter of fact, the company that I stole this from actually makes a special little box that would go out in the field to actually pull all the logs in. But then you have to get the logs off of that thing back into the core again. Now, so this is very noisy.
Any any of the ones. Hold on. Let me go back. Okay. Costs. Okay. We're here. Okay, good. Ballpark ranges. So let's use oil and gas, and then we're going to go into health care afterwards. Real quick, here, who works in oil and gas? Health care. Oh, okay. Well, we'll go through both use cases.
So, bar ranges for handling data coming back over the wire. And this is just logs. Right. So I was able to do a bunch of research and and find out how much all this stuff costs. So, typically all together, after I did some calculations and I went to lowest I possibly could, was somewhere around $400 a day to get just log data back over, Leo satellite, which is the least expensive storage, compress it and back it up.
Right. So it's right around $400 a day just in. And that's just a, single firewall. A couple of access points and some other gear that was actually on site. That's all it. And it's still $400 a day. So not exactly inexpensive, right? Another one. Hospitals. Now, I'm putting Richard up here because he's actually one of my best friends in the entire planet.
He's also, got me on to the Fox Hunt team at Defcon. So if you've ever been to Defcon, you done the fox hunt? That's our crew. So come on out to Defcon if you haven't before. And, if you come to any of these kind of conferences, you'll absolutely love that. It's a little bit more loosey goosey, but it's a lot more fun sometimes.
So, Anyways, Richard is now the CTO of Alberta Health Care after spending a whole long time at, one of the other cyber companies I was part of. And, he was complaining to me one day about this, and he says, you get 7.3TB of just log data just from his firewalls, from his 300 hospitals and 400 clinics throughout all of the state of Alberta.
Right. So because Alberta health care is all state driven, and he has all these hospitals and clinics to take care of, that is a lot of logs from just firewalls. So just the firewalls alone, I think, after all said and done, for an entire year, his grand total was $10.2 million just to store forward and back up and protect the data that he got from the logs, from his own firewalls.
That's $10 million just in that not everything else, just $10 million and storing the logs. That is a lot of money right now. Imagine somebody to be able to save that kind of money and be able to spend it on things that would, you know, better help out the entire rest of the team, or maybe just give the really, really exhausted, soc lings and knock lings some bonuses.
Be kind of nice. Right? So, yeah. Answers and yeah. Yeah, it's I all right. So in apologies in advance. This is more not saying product. This is just a grassroots how can I do it myself type situation. So everybody this gets a little bit more technical after that. After we had a little bit of a talk about the the budgets and budget analysis and why you would want to enter into a project like this.
Now, there are companies that are attempting to do this already, and I think they're calling them Sams security, something mesh sharing security analytics mesh that are starting to try to do something like this. Right. No idea what those companies are. Just heard about it. And, and know that there's a bunch of startups kind of around some of this stuff.
But that being said, I've been working on this for a while, and, we got it working. So, and we'll get into exactly what we're doing here in just a second. So has anybody ever done any Python notebooks? Okay. So those Jupyter notebooks and everything else of that, has anybody ever come across tf IDF? Okay.
So all he is, is text classification. That's all we're doing. And the whole thing, we're just tearing apart text documents and classifying the words inside them and placing them inside a bunch of buckets. That's really all it does. And then after you play some a bunch of buckets, depending on the model, you run over it. It will then make some selections and say, this is the stuff I like and this is the stuff I don't like right now.
I say like, that's a very deontological talk, which is not what I is capable of. If anybody's been an AI, anybody with an AI or doing any research more than two years in here, okay. One thing that we need to know about AI and the problem with it is before we get into how you do this stuff, talk.
Right, is that every bit of AI we have now is only ontological thinking. Ontological thinking is just data sorting, pattern recognition, that stuff like that, when you do that stuff really, really fast, it just seems like it thinks faster than you, right? So and so we're all aware that, you know, all I does is make a bunch of errors, and the only thing it shows you is the thing it actually got right.
Right. So it's just making billions of errors in the background because it's able to calculate so quickly. And then it shows you the fun stuff. Right. The stuff that you actually asked for. So the same way that we're talking about that this tfidf now this isn't really AI, this is actually a machine learning. That's all this is. And it uses, a very well, it's called the pandas library.
And you can accelerate with anything. Pandas library is just tensor. You can do it with a tiny TPU that's actually on USB. You don't have to have a very expensive processor on. You can, you know, put a whole Nvidia thing on it if you really want to get all fancy and processed billions and billions of gigabytes of data, if you really want to get down to it, there's a bunch of different ways you can do this.
But this is a very simplistic formula. It's been used for a long time and the bugs have been kicked out of it. So it's very reliable actually. So it's just a bit of machine learning now. So we're using that and how I got the data back is I sent it into I have a Kubernetes cluster somewhere now.
Anybody done Kubernetes at home okay. Look up a company called portainer.io portainer port I e t I n e r. You can set up a Kubernetes cluster within about 30 minutes. Maybe if you use Portainer super duper easy, all you do is need maybe some proxmox and then throw the portainer on top of it, and all of a sudden you have your own Kubernetes cluster within 30 minutes.
Super easy to do. They make it really, really simple. Yes, the founder's a friend of mine, but Neil's such a good guy that I suggest him everywhere. Plus, I don't do Kubernetes because I hate that crap and the only way I do it is actually do it where it's makes it really easy and simple to use. Easy and simple means elegant, right?
So and it's both. So I got my nice little Kubernetes cluster that I put together really easily. I know that's usually a, you know, oxymoronic, but, behind this thing and I just have in grok now, in grok, all that does is make sure that I can get data somewhere. It's a cloud service just like Cloudflare.
And you could probably buy the same thing with Cloudflare, but it takes a little bit more engineering in grok. All I do is put down a credit card. I put my, you know, my IP addresses and my my host names in it, and it's done. And now I can send data securely over either API or something called MCP, which we're going to talk about in a second from an AI that you put anywhere in the world.
Right. So all of a sudden this thing is secure. It's got an API gateway all built in. It's all fancy. And they do basically everything for you. It's the easy button to get data into somewhere central. So we're going to talk about how we take care of the logs where the logs are. So we don't have to pull them back over the wire.
We don't have to back them up, and we can sort them and make sense of them and only send the stuff that we want back over the wire, wherever we are. I mean, satellite DSL, if you want. If someone still has that somewhere, I'm sure they do. You know, Leo satellite, any of those things, Wi-Fi or anything else like that.
That stuff can come back very securely over that particular API gateway, that you have in the cloud. And then you can send that gateway on a VM down to wherever you want it in your home lab or anything else like that. Cost nothing. I pay maybe five bucks a month for that thing, right? Super duper useful. A nice little cloud service.
Cloudflare has it too. But then you have to do a whole bunch of stuff behind it, and it's anytime I have to do a whole bunch of something, and it's not easy to do, I just find something else to do it with. So because, I don't have a whole lot of time, I'm sure everybody else in here doesn't to play around with their laboratories.
So it's term frequency. Inverse document frequency. Right. So all that is, is you have terms and you have, the actual document itself that you're getting. So each one of these syslog, you know, things that you get in is considered a document. Is anybody ever played with influx database and like, any of those stuff at their home firewall influx and, and Prometheus and all that kind of stuff.
So also really easy to set up if you want to start playing with Influx and Telegraph and all the the package that they have, it does all the kind of Docker containers. So that's the reason why I pulled up a Kubernetes cluster is because I use that on the back end to kind of sort my data sort and send it, using the Tfidf stuff.
So, instead of using influx, though, I use an elastic database on the little bitty computer that I've got out in the remote place to do the document sorting. And then I've got one and, you know, the VM in my data center. I'm saying because it's actually in my laboratory at the house, with the backup that desperately and sorely needs a new battery in it.
So the but that's what I do, right? So it's a real simple lab to put together if anybody wanted to do it. And all you need is a few things with Docker containers, something to be able to control a bunch of Docker containers on the the, the little bitty host that I put, you know, proxmox on, and make it into a router, using a bunch of stuff.
And then on the back end of it, I put a couple VMs and then a couple Docker containers, and that's all the little bitty boxes. It could be a little bitty knock or nuke, whatever you want to call it, and it can sit out in the middle of nowhere, right. And then and then takes this log data in and any other kind of data logs like windows authentication or anything else like that.
On to one little B central box and then do this sorting and only send up the stuff that I want over the wire right out at the edge. The reason why I did this all over an API at the the little bitty box side is because see. Oh, that's right, I got it that way. I just not I guess I'm not doing it right.
Could y'all advance the slide just. Oh, there. I finally got it. Okay. Yeah. Yeah. Okay. Okay. So, so we're gonna talk about how it does all this stuff, right? So we talked a little bit about the architecture, about how have an elastic database had a little bitty endpoint out there running on top of Proxmox, and almost everything's held on Docker, my elastic database and my telegraph.
Right. There are two different Docker containers. And then in the central part I've got the same thing, but it's all hid behind a, in grok. And so Ngrok and Telegraph talked to each other and then everything's super secure. I don't have to worry about data leaking. It's all over TLS and nobody has to get into any of my gear.
Right. So once I've got that infrastructure set up and I got my containers and I can actually deploy stuff in containers, what do I need to do now? Right. Well, this is exactly how it's going to to work. I have to go create a training set. So my training set I'm going to go get a bunch of logs from let's say for gates.
Right. So and I'm going to take, a bunch of the 40 gates, I'm gonna take a whole bunch of logs, and then I'm going to take all the logs I don't want to see, and all the logs I do want to see. And I'm going to put them into two different training sets. One is the positive and the other one is the negative.
And then there's even going to be the objective. What I see is good. Like if I know something's like a detection, like a critical, I'm going to put those in the third data set. Everybody understand so far. Good. All right. So those are just piles of logs. That's all they are. It's just a set of logs I can then feed them into pandas.
So I create a pandas library. All right. Great. Good guys. Grabbing the pandas library, I create, my Jupyter notebook. After I got my Jupyter Docker container up and running. And then I actually have the Python. I put it in there with the data sets, know with the names and tags that I actually want to have for those particular data sets.
Right? I pull all that stuff in just by running a couple of scripts, and then I have the actual Jupyter notebook itself, and all of a sudden I've got myself a model as a predictor on it. Go ahead Eric, what's up? The analogy for this is kind of like the more advanced version. So.
That's actually a really good point. So Eric just said, is this just like when acceleration from 20 or 30 years ago, which is what Silver Peak turned into, right where they before they were SD-Wan, they were this thing that would squash down a bunch of data. It is very similar to deduplication and squashing. Yeah, it's exactly right. So of the data before it gets sent over and I think, Carlos, the abomination from Cisco I was, was I forget.
Anyways. Yes. Yeah. So, that was a terrible thing. So, but, anyways, it's a very similar. Right? This is the what the process is, is a bit of deduplication, but it also does, people call this a genetic. Now, the thing we're doing at the edge can be considered, an AI agent. The way we're doing it with this notebook, the only thing I'd have to do to make it a full agent is install or, bring up a Docker container that has something called the MC on it.
And then I would send my results from my telegraph into my MC server, and the MC server would then send the data up over. That makes it an agent. So all of a sudden that happened using a generic I have to create it myself. Now does that mean anything? No. So I never did. Right. So the the bull shirt, you know, term is a genetic AI.
So and that's all I saw at Blackhat this year in RSA. So take it for what you will. You guys can do it too. It's not really all that hard. Creating an agent is is as easy as what we're talking about. And there's reasons to do these agents right. If you start playing around with these things with the logs, you will find what data you want to sort and only get that stuff out.
You save if you did this yourself, even to show it to, somebody and say, okay, there's these products out here that's doing this thing right now, without having to deal with salespeople. If you showed this to anybody's boss and how much money it would save, promotion time or bonus time for seriousness, like, that's this, this stuff.
If you saved the company $10 million, all of a sudden crazy things happen for you in the corporate world. I was never able to do that because I never had the attention span to finish anything. So I'm sure it's shocking for all the people that know me. So, so when I was talking about some of this stuff, this is essentially what happens.
Those documents are all the logs. It goes through our text processing. This is the kind of 30,000ft view of what I just talked about. And then that tokenization, the bag of words that you actually get, those tokens are the things that create and the tokens are the things that if you have ever seen an AI decision matrix, it looks like a bunch of dots.
You know, the thing goes through a bunch of these decision points. And those are sometimes called vectors. Sometimes they're called tokens, whatever you want to call it. Right? So, the vectors are actually the weighted connectors, but it everybody uses the same terms to talk about the same thing. So there's little dots. Those are your tokens. And the vectors of course are the weighted lines that go between them.
But those are the things that actually help make that decision process. That's the model you've created that you put all this data through. So now I can get out just the logs I want by doing a little bit of training of the thing I want to see. Now, you notice how I went very specific down to something like a fortigate, right?
If you don't do something like that, this will not work. I have tried multiple times with different types of models built on myself. Use other people's models that are on hugging face. By the way, anybody been to hugging face yet? All right. Great. So if you know hugging face you know that's the place to go get all this stuff for free.
So if you wanted to do it, you get the stuff and I say free. Well whatever. Anyways, you get, you know, download a bunch of this stuff for free and even get your own AI models and talk about it with everybody. Sort of like the stack exchange of AI. Okay. Thank you. So, but this is basically what it does and you're getting stuff out of it like this.
This is exactly what I would get, you know, that amount of backup and data down. And then actually the stuff that you would then forward into is I didn't talk about Splunk. I was I was doing this talk about transit of data and backup. And I started talking about and not to brag on Splunk, but it's just an easy one that is ridiculously expensive to afford, right?
And, most of the Sims are right. If you I just got the logs that I needed into the SIM, all of a sudden everybody's job gets a whole lot easier, right? And that's what this is all about. So if I do a bunch of pre-work, put some agents out of the edge where we have some of this sort of stuff, and you can do this as a pilot, right?
You know, in your businesses just to do, 1 or 2 of these things and find the model that actually works for the logs that you want to pull in, then you'll be, you know, way down the path. So, this was supposed to be a quick one. Because, you know, logs can not be scintillating. And I didn't have as many jokes as I thought I was going to have for this year.
But, if anybody's got any questions on this or wants to help put together a lab, they can reach out to me. Come up to me afterwards and I'll give you my my, my business card that that, unfortunately does something really nasty. I'm not kidding, but I got an NFC reader card that I'll put on your phone, so the, but definitely get my contact information and I can give you some actual instructions and some downloads.
I don't make this stuff public. Because I don't really want it being an open source, because I don't want somebody taking what I do and doing something really, really nasty or sizing it. I like to give it to, you know, local community people and then have it be word of mouth only. So, I love my open source, but after finding some of the stuff that you do, an open source in the wide, wide world and someone's made it into a product and, you know, then they start trying to license the stuff that you originally built.
You kind of lose the whole, you know, love for open source pretty quickly. So that's just where I've been at with it. So personal apologies for that. Now, any questions about any of this stuff or what we went through. We kind of burn through some of the documents, the Jupyter notebooks and the requirements here. I've got a list of the software that if you do get my contact information, I will email you everything.
And maybe not a total step by step, but something pretty close. So yeah. And why did I start doing this stuff? Well, I've been through four cyber security exits so far. All successful. So I worked for Fortinet when it was pre IPO. I got the IPO. That was wonderful. There's not life changing money, but it was it was it was good.
And the company was is great I don't mind it. I went to go work for another company called cyber X which is in the OT cyber world. And now it's called defender for IoT. Great company. But you know, Microsoft just wasn't my bag. Most recent work for a company called XRP. They just got bought by a company called Le Grand.
So and I've done a consulting for a couple other ones. They get it for a little bit of equity as well. At no time have I ever had to stop working or be able to totally put my kids through college. So and after doing a significant amount of work for those particular startups. So I decided to start one myself.
So if anybody wants to talk about, you know, what we're up to and what we're doing, come on over and talk. Because, you know, I just like helping people out inside, the community, especially the, the Houston one. So thank you very much for coming on my talk today. And, and thanks for putting up with me for about 30 minutes.
Thanks, everybody.