Enterprise-Level Data Science with a Skeleton Crew

By Tim Stacey, Ph.D. / Adlumin, Inc.

“Data science is a team sport.” As early as 2013, this axiom has been repeated to articulate that there is no unicorn data scientist, no single person that can do it all. Many companies have followed this wisdom, fielding massive data science operations.

But more often than not, a big data science team isn’t an option. You’re a couple techy people trying to make waves in a bigger organization, or a small shop that depends on analytics as a part of the product.

My company falls into the second camp—at Adlumin, we’re a small team that has a big enterprise-level problem: cybersecurity. We use anomaly detection to monitor user behavior, looking for malicious activity on a network. Because we need to catch intrusions quickly, we perform streaming analytics on the information we receive.

Two things allow us to succeed: building on the cloud and thoroughly testing our analytics. Cloud isn’t specifically for small teams, but it helps to compete with and exceed bigger competitors. Testing is our failsafe. By implementing useful tests on our analytics, we can have assurance that the models will perform when they’re released.

Below are three principles that I’ve distilled into part of a playbook for doing data science at scale with a small team.

1. The cloud is your friend.

One issue in data science is the disconnect between development and deployment. In a big company, the data scientists often have the luxury of creating something that won’t scale and then punting deployment to the engineers. Not so on our skeleton crew.

Enter the world of the cloud. By moving most of your dev ops to a cloud-based platform, you can work on getting the analytics stood up without any of the tricky details of database management or orchestration.

For streaming analytics, two great options exist: serverless and container based.

Serverless analytics involve spinning up a process when data comes in, doing some number crunching, and then disappearing. This can be a cost saving measure because the server doesn’t have to be maintained to wait for new data. However, the analytics must be fairly lightweight—most serverless offerings will time out long before you can load up a big model.

Containers are more permanent. We still can have live, streaming analytics, but now a container will load the model and keep it ready to receive data all the time. This can be a useful configuration if the model is going to be large, the library requirements many, or the uptime constant. This is also a preferred method if you have a handful of workhorse models for all of your analytic needs.

At Adlumin, we aren’t drawing on heavy libraries and we need to evaluate many (>5000) models quickly, so a modification of the serverless option makes up the basis of our anomaly detection.

The beginning of our method starts by building a baseline model for each one of our users. This is set up on a weekly interval. We probe a large data store for user behavior data, build baselines (which are small weight matrices), and then store them in a fast NoSQL database.

To process live data, we collect user data in sessions, which are event streams broken into chunks. Once a session appears to be complete, we spin up a serverless process to read the session, query for the appropriate baseline, and evaluate the two together. A result gets passed to another database and the process dies, ready for the next session.

2. Get something that works, then test it.

Sometimes testing seems more like a necessary evil. The best test might be the biggest hurdle when you’re on a tight deployment timeline.

But you need to find a way to evaluate whether your analytics are returning sensible results. Again, there are options:

  1. Real testing: Someone has imparted you with a cherished “golden” data set. This data contains ground truth labels, and you can perform classic train-tests splits, evaluate metrics, and other rigorous testing.
  2. Natural testing: Instead of being handed a data set, you can construct a ground truth from information external to your dataset. Join multiple data sets, manipulate metadata, or come up with another way to create a target.
  3. Artificial testing: Make a data set! This is a great inclusion into a testing suite, even if you have either the first or second option. You can create small data that will be evaluable every time you push new code.
  4. Feel testing: Run your model on live data and observe the output. Does the output meet your or the users’ expectations? You want to know if you have a really noisy model, a quiet model, or something in between.

At Adlumin, we have some data that reflects ground truth. For instance, saved penetration testing data reflects what a single type of an attack might look like. This is a great opportunity to test out our models, but attacks can take a number of forms, which creates an upper bound on the utility of this data.

Additionally, we know a little bit about the users we monitor. For instance, many companies create service accounts to perform the same tasks, day in and day out. We test to see if we routinely flag these accounts, and if so, the models need to be heavily reworked.

Finally, we created our own data set, complete with data that reflects both normal and anomalous behavior. We integrated this into a test before model deployment.

3. Orchestrate a lot of things at once.

One additional item that makes this all work is orchestration. Orchestration assists our automated tasks by arranging the code and managing all of the tasks.

We use a continuous integration system that puts all scripts into the right places (e.g. updating the script for serverless processes, and pushing new code to the baseline generation server) when we push any new code. We don’t have to scp anything into a server—the push to our code repository covers everything.

In addition, tests will automatically fire when code gets pushed. If the tests fail, the code won’t be updated and erroneous stuff won’t get deployed.

Updating the whole operation piecemeal would be tedious and error-prone. There are too many moving parts! Orchestration also allows us to move quickly. As soon as we develop new code, it can be run against tests and put into the right place without having to consider any additional steps. This frees up time and also headspace formerly preoccupied with deployment details.

There are many other aspects to making streaming analytics work in a small team, but these are three important ones. Doing enterprise-level data science with a skeleton crew can be challenging, but it is rewarding and fun!

Originally published on Oracle’s AI and Data Science Blog / August 23, 2018

Dr. Tim Stacey is the Director of Data Science for Adlumin Inc., a cybersecurity software firm based in Washington, DC. His work primarily focuses on user behavior analytics and his experience includes designing analytics for Caterpillar, the RAND Corporation, and the International Monetary Fund. He holds a PhD from the University of Wisconsin Madison in computational chemistry.