Ending the Guessing Game: When Data Science and Visualization Come Together
Webinar Transcript – Grant Case, Analytics and Data Science Architect – Dataiku
Good Green. Perfect. Love it. Hi, my name is Grant Case. I work for Dataiku as a analytics architect working specifically with data scientists and analysts and Tableau customers. Over the last six years I’ve worked with 200 plus customers across the country. Uh, all in and around analytics, machine learning and data visualization. I’m actually an old Oh gee. When it comes to tableau, tableau for I was the second purchase of Tableau inside of Citi Group about a decade ago. So I love Tableau. Always have been. So the talk, this talk is all about ending the guessing game. How can we take visualization and data science and bring them together to come to better solutions So with that, what I want to talk about today is the use case. So all all props to inter works who actually if you were here at TC 20 2017 actually came up with a great, fantastic use case of would you be flight delayed if you are leaving Las Vegas.
Now how did they do that The great, they made a fantastic interactive dashboard in tableau. But what’s on the backside of that was data Iq. So what we’re going to talk about today is how data I created predictive analytics and also the inference to understand how weather impacts your flights. So in addition to working for Dataiku, I’m an adjunct professor at Columbia University and a masters in analytics program. So I’m going to do a quick primer on machine learning and then of course demo time. Let’s actually look at what’s going on and what’s happening. So when I talked to my students and when I talk to new customers and they’re making their way down the machine learning path, I always talk about prediction versus inference. So the way to kind of think about it prediction is am I going to be accurate So in the case of a weather, if my flight delayed, right, I can add it, start adding variables about my weather data about what’s happening at the hubs around uh, the us about flight numbers, about time of day carriers, all of these variables we can start to put into the function or the model to actually start to understand if something’s going, if a flight will be delayed itself.
Now prediction in this case is all about how do we predict if that flight is going to be delayed. We only really care if we were correct or not. And that is okay. So typically when we’re talking about prediction, is somebody going to buy something from us Or if I send a mailer out to someone, are they going to return the return it to us The techniques used to estimate that are all about in the functions and variables are all about minimizing that error. The bottom line in probably all of you have worked with Tableau are all coming into this concept of inference. What is inference Inference is trying to understand why something happened. So as you’re going into a tableau dashboard, you’re starting to drag and drop onto the canvas. You’re trying to understand, okay, what variables, what correlations are actually impacting my cost of goods sold, my revenues, my Roi, whatever that might be.
Right So the key takeaway to understand is prediction is, I’m always trying to guess whether I’m correct or not, right Inference is really, and I like this, I can actually explain what that prediction was and how I got to it, to a board number, right And those are two key important aspects. Now, the second set of a discussion I want to talk about is bias versus a variance. So as you’re building models and machine learning, as you’re trying to understand and how you’re trying to understand your data, science are your problems. You start out with what’s referred to as high bias. So high bias is all about, hey, I throw a variable in and I think a result’s going to occur. Your model sucks if you have a lot of high variants, it’s very simple. A lot of this ends up being conventional wisdom who deals with conventional wisdom all day, every day, right.
Everybody’s hand should be up, right So what do we do with tableau and what are we doing work well with tableau and the visualizations, we’re trying to understand the correlations were dragging things on the dashboard. We’re trying to get to some sort of understanding of what variables are impacting, right But the human brain can really only take about three variables at any one time and try to correlate them in their head. The wonderful thing about algorithms in general is they can take hundreds, thousands, tens of thousands of variables. So that the most complex model I know of was a former IBM distinguished engineer. She had a data set of 10 million variables. So it was one zeros, one zero and if you actually had visited a website or not, and their whole concept was they had to make within eight tenths of a second of whether or not they would buy a digital ad, right.
So it had to be very, very fast. So when they’re using all of those variables and when we’re using the algorithm, the algorithm can actually figure those out. Now as you train that model, you got better and better and better, right But what happens when you decide, okay, I’m going to take that data and put it on something else. So here’s some new data I don’t know the answer to. That’s when we start to get what we refer to as validation error. And you can also start to overfit. Meaning if Grant was here at this time and this area, well Grant’s going to be right here, right Well I’m not going to be right here. Come Friday, I’m going to be on a plane back to New York. So that’s where we start to over fit in terms of variables. Your model socks here too, that means it’s too complex, there’s too many variables maybe.
And that’s ultimately not good for you as well. That’s where you might send up and like excel. Hell, right So you’re throwing everything in there. Now I bring all of this up to you because ultimately we at Dataiku with inner works, we’re trying to help you get past these elements. We want to make it easy for you to understand and consume your data just like tableau does. We have wonderful ways of pushing that data into tableau. So ultimately let’s take a look at this use case and try to understand how we took this concept of creating a great interactive dashboard and then push that out to the tableau community. So with that, let’s go into demo time.
So the first thing I want, if you’ve been here before, and if you were at Tableau 17 you saw this dashboard by inner works, will your flight be on time from TCC So ultimately we were collecting data, commercial flight data, analyzing that data, cleaning it up, and then creating a model. So how did inner works did that They basically have their own application and their own in their own dashboards and excuse me, their own Java script to run all of this. So you select your airline, you select your destination, your departure in, in time, and ultimately what happens, right What happens is a great looking Tableau dashboard. So everything that’s going on in this Tableau dashboard, nope. Oh Man. Yeah, that’s fun. I always love fun. So there we go. Uh, you gotta be kidding me. So we’re going to do this live there.
So that Tableau dashboard and while we’re doing that, we’ll just do, is ultimately trying to ferret out the variables and the importance of what’s going on and what’s happening. So what happens, right So you’re submitting your stuff and you’re ultimately getting some variables back. Now this dashboard is basically looking at TC 18 right So TC 18 what weather’s coming in. But when we’re building models, we’re ultimately trying to do that for both sides of the house. We’re trying to actually understand both the inference why something happened and the prediction of can it happen again. So what you’re seeing on the screen is data Iq. We are a self service analytics and machine platform for data science. We work inside of projects inside of data Iq and part of that starts to actually build up the analysis. So what were we trying to do as a part of this analysis.
While we were going to source data from different locations, we were going to take the transform those calls and transform co airport call ids, massage the data in many of the same way as you do, we build a list of the hub airports and Y, right Anybody want to take a guess of why we, we would build a hub, a list of airports Well guess what happens inside of the community inside of when we’re having flights, those hub airports, if they start having issues, even if we’re in here in Louisiana and we’re not having problems and there’s no rain, if there is a massive thunderstorm over Atlanta, guess what happens Delays go out from the hubs. So this is actually part of what you do as a Tableau analyst. You’re actually figuring out, okay, I can start to look at this variable. Okay, if Charlotte has a problem, that means American flights are going to have issues.
So therefore we should probably look at that. So something that might impact and American flight may not impact my flight, right Because maybe I’m flying delta through Denver and then ultimately we’re going to build that model. So how do we build that model What we’re taking a number of different variables of importance, we’re bringing them together and ultimately we’re going to do push that result out. So those results can go to web API APIs like a rest call, which what inner works is doing, they actually hit every time you make a call, they’re pulling that data out and they’re rescoring that model each time. We also could throw out to batch, we could throw out the tap low. So ultimately whatever that production lifecycle for you is, we’re there to help be a part of it. Think of us, it’s more of upstream, upstream of your dashboards and of your analysis. So how do we do that How, what’s going on here as part of this process.
So what we have in the middle of this screen is what we refer to as a flow. It’s a directed a cyclical graph and it basically is an instruction set. So just as you actually start to build a project and you say, okay, I’m going to copy this file down, I’m going to process it one way, I’m going to run my dashboard. Maybe you’re good at scripting. This isn’t in fact what we’re doing as an a PR in a project sense. So also part of projects and part of projects is trying to collaborate. So how do we collaborate Well, we start to have discussions amongst our team, so we try to build that into the tool as well. So as we start to analyze our dataset, as we’re kind of pushing things back and forth between tableau, we’re trying to understand what happens. So just as you would actually start to do and work in data, we also start to work in data as well.
So what makes your job easier as an analyst is actually going out looking at the data itself and understanding what’s happening and God, this is so what do we do And tablet are what are we doing data I could, we make the data easy for you to understand. We start to sample it, we start to understand it. We ultimately give you the concept of being able to see your data as you transform data, as you visualize that data, that data flows along in a pipeline. So machine learning, data science is ultimately a pipeline of what you’re doing. And you as an analyst, you as a data scientist, we all believe everybody’s gotta be on the field so we can no longer work in silos. So how do we do that We bring everybody together. So how do we do that from a machine learning perspective.
Well, we start to take out components like random forest decision trees, different machine learning algorithms inside the tool. So we can take a normal set of data, Neda variables, and start to understand of it, understand it from an interpretation and a variable importance perspective. Remember I was telling you earlier that we were inference versus prediction. So how do we start to understand what our data sets are doing she we try to understand what variable importance.
So a part of the dashboard that the inner works had created was ultimately trying to elicit what that variable important says. Well, how did it get there Well, it gets through through the algorithm itself. Whether we’re talking about a supervised algorithm like random forest or decision trees or xg boost, or we’re looking at unsupervised. Any folks in marketing do a little segmentation work, right Yeah. So how do you segment your, probably figuring out, okay, you’re creating a persona.
Jill is between the ages of 25 and 34 she’s urban upscale, right Part of how you do you can do that is actually take unsupervised machine learning such as two step or no. R K means clustering. Ultimately, why does that, why is it important Why do we care Ultimately that clustering capability gives us, starts to elicit those instance variables in the v. What’s important about Jill, what’s important about what she does and who she is, and that’s the algorithm’s. You’re letting your data tell the story just like you let your data tell the story. When it comes to tableau, the algorithms help you tell that story in the best way possible. And the algorithms themselves can actually tell you about the variables that might impact you. So back to this particular flight delay, right What’s important about flight delays Well, do you believe flight number would be important as a flight delay.
Probably yes. Right Because what happens a lot of times, and as a former United employee, we will actually, you will bake in ground time, right You already know that flight’s going to be delayed, but you as a customer, you want that flight to leave at 11 o’clock but I know for a fact you’re not going to get to JFK before three 30 so what do I do I already bake it in, right So that’s how we try to understand that’s why flight number would be important. Because I see Delta, I see that flight number and I know exactly what’s going to happen.
If also part of understanding what you’re, what you’re doing is trying to understand where you true or not back to that predictive side. Right So what’s important sometimes when we’re talking about machine learning is the prediction component being, hey, is it better for me to be correct.
So or is it better for me to miss And that may be important, right So if I’m spending two or $300 to try to acquire one customer, that’s a lot of money. I would probably want to be always. Absolutely sure. Because I have a very finite budget. That’s what we referred to as a true positive, right So, but what if, what if I have a $3 million Any manufacturing folks in the manufacturing What happens if on the line a machine breaks down, right That’s probably a couple million bucks. That line goes down. You’ve got to send everybody home, right That’s when it says, you know what If I missed, that’s okay because I’m going to send a tech out and they’re going to spend about, I’m going to spend $200 because there could be $200,000 if I have to shut down that blind. So that becomes part of this process of this machine learning individualization.
And this is also about the confusion matrix. So as we start to work inside of tableau, you can start to show people off. It’s like, Hey, what’s the cost What’s the cost of something occurring Right So we try to do that. Now, ultimately, the big answer for any model, right, is whether or not you were actually correct. If we’re talking about prediction, how do we do that What’s you’re seeing on the here is a lift chart. This is something you can actually do yourself. Anybody had done a lift chart in tableau Perfect. Awesome. So lift charts are, all this is trying to do is trying to understand how much better is my prediction versus randomness, right So the straight Arrow, so the left to right is really just, hey, if I took random guests, what would that answer be The top one is the wizard, the wizard being, if I’d made a perfect guest, ultimately I want to be above that line, right
What happens if I’m below that line You’re worse than random. That means it’s awful. He’s awful. That’s the folks that you want to go to Las Vegas and they’re betting one way on the game. You bet the other way, right Cause that ultimately that that model is telling me it’s a better answer overall. So again, when we start to talk to folks like what is a model man, you have to be able to understand and talk to your clients, talk to your business, end users, talk to your, your CFO, your CIO, your CTO. Because ultimately while these are very, they may appear to be very complex and there and how they’re working. Ultimately what’s going to happen. So all of this is going to be part, we’re all going to be doing this in the next two to three years. All I’ve seen it happen. I, I was there at the very beginning on the date of discovery side.
I looked at it and I said, you know what tableau, this is this is this is life changing anybody Is it life changing that you’ve worked with Tableau has for me it sent me basically I spent my entire career at Citi group basically building Tableau and helping work on things. So ultimately as we start to understand how models work, how they start to work with our data visualization, it becomes so very, very important to understand both sides of this house.
The visualization side meaning hey can we actually, can we make it interpretable Can we build things like Corinthia charts Can we build things like rock curves to ultimately, while they may seem very complex, if we start to think about things like over here on the right hand side and how we discuss these things, it’s pretty simple. Hey, so long as this is going up to the top left, we’re in good shape.
The more we go up into the top left, we’re in great shape, right That’s what visualization does. Visualization can help us find the variables of importance in a 10 million variable product or Dataset and you probably are not going to want to try to do correlations each time. Let the algorithm do it. If you’re trying to understand what my revenue, what’s driving revenue, well, you know what I can go out and actually do that. I can do that dashboard and start to actually make it an understand that variables. One last point, I want to leave you as you talk about and you understand what’s going on in this world, right You as Tableau analysts are on the front lines of what’s going to happen over the next two to three years. This concept of data scientists sitting in a void that where they don’t know they sit.
We had the old programming adage, take the the pizza box and just slide it under the door, right A little bit of that data. It’s kind of becoming an around the data science realm. It’s not going to be, it really isn’t. You are the front line. You’re the folks that are going to be figuring out what are the problems, right What are the variables that I want this data science has to look at because maybe you don’t understand necessarily what xg boost does and how it works, but I can give you the tools to understand what’s actually driving the problem, right And then what you’re going to be doing is becoming part of that collaboration. We call it vertical collaboration within Dataiku where the purpose, who the person who builds out your dataset maybe at in data engineering that gets your extracts and make them ready for your tableau, but what’s going to happen next, right.
You’re going to be building those dashboards. You’re trying to analyze what’s happening. You’re going to give that to the data scientists because maybe the data scientist has no idea what your business is. You do. You spend all day everyday working in it. So hard it you are part of that process of data engineering, data, anal, Angela, that excuse me, analysis and then ultimately that data science and what we’re doing, we’re trying to make that available to you. Just like what inner works did when they built out their dashboard and they made it available to the the tableau community. That’s going to continue and carry on over the next few years. Thank you so much. Thank you for spending time with me today. So take care and have a great rest of the conference.