Logo

The Data Daily

Data Science 101: Steps to Becoming a Successful Data Scientist w/ Randy Lao #DataTalk - Experian Global News Blog

Data Science 101: Steps to Becoming a Successful Data Scientist w/ Randy Lao #DataTalk - Experian Global News Blog

Every week, we talk about important data and analytics topics with data science leaders from around the world on Facebook Live.  You can subscribe to the DataTalk podcast on iTunes, Google Play, Stitcher, SoundCloud and Spotify.

This data science video and podcast series is part of Experian’s effort to help people understand how data-powered decisions can help organizations develop innovative solutions and drive more business.

To keep up with upcoming events, join our Data Science Community on Facebook or check out the archive of recent data science videos. To suggest future data science topics or guests, please contact Mike Delgado.

In our upcoming #DataTalk, we’re talking with Randy Lao about the steps it takes to become a data scientist — and the importance of giving back to the data science community.

Mike Delgado: Hello, and welcome to Experian’s weekly #DataTalk, a show where we talk to data science leaders from around the world. I’m super excited to be chatting with Randy Lao. He teaches machine learning at the Data Application Lab and data science boot camps at USC School of Engineering. He is prolific on LinkedIn, he is always helping the LinkedIn community learn more about data science, and it’s an honor to have Randy in today’s chat. Randy, how’s it going?

Randy Lao: It’s going great, and it’s my honor to be here. I appreciate the time for you to have me participate in this. Thank you so much.

Mike Delgado: It’s great having you. For those who aren’t following Randy, I’m going to have a link on our Experian blog so you can connect with him on LinkedIn, follow him there, because he is sharing just so much knowledge about the things that you need to know if you want to work in data science.

What I love about Randy, too, is that he’s so encouraging. He supports the community, he’s always giving back, and for those who are not following him on social channels, make sure you’re following Randy. I’ll give you the short URL right now. The Experian blog is just ex.pn/datatalk48, and that will bring you over to the Experian blog, where we have links to Randy’s social profiles. I highly recommend you follow him there.

Today we’re talking with Randy about one of the biggest questions we get in our Facebook group, which is, “How do I become a data scientist?” And Randy, when we were emailing back and forth, said, “Let’s talk about the steps it takes to become a successful data scientist.” To kick this off, Randy, can you share with our community your journey that led you into data science?

Randy Lao: My journey’s been pretty sudden. Everything that’s been happening with LinkedIn, with learning data science, all happened last year. And this is all due to my Springboard experience in Data Science Boot Camp. A big thanks to Springboard. It’s a great boot camp, and what I’ve learned from that is the whole process of data science in general, not just the learning aspect, but more on the self-learning, and is never-ending.

There’s so much to learn, and what I also learned is the value of networking. Connecting with people, building that relationship, and just understanding who they are and what problems they’re going through allows you to be more capable working with different problems. I also got a chance to work with USC. So in regards to an e-learning platform, I’ve been exposed to three different boot camps: one at USC with Trilogy Education, one with Springboard and then one for Data Application Lab. I do have a breadth of knowledge in regards to what people should be studying for and especially some advice for job searches.

Mike Delgado: That’s awesome. Tell me about your move into machine learning. You teach machine learning at these boot camps. What drove you to focus specifically on ML?

Randy Lao: I actually focus on both data science and machine learning, but what I like about machine learning in general, and what I emphasize everyone else to focus on more, is that it’s gonna be the future. Machine learning, AI, deep learning. There was a quote from Mark Cuban saying, “At the moment, if you don’t know what machine learning is, in three years, you’re gonna be a dinosaur.” So, if that doesn’t motivate you —

Randy Lao: Yeah. I heard that. But my main motive was that it’s pretty cool. When I break down data science, I think of someone having two skills. Using machine learning as a way of predicting things. Like you’re a wizard. What I tend to write a lot on LinkedIn is I try to simplify these very vague terminologies.

A simple breakdown is machine learning is just data plus an algorithm. At the simplest case, that’s about it, and you’re just using these different algorithms to come up with different predictions. Each prediction can be accurate and be evaluated for different use cases, such as supervised learning, unsupervised learning. In regards to that, I find that having the ability to predict things, or become more prepared for the future, is something interesting.

Mike Delgado: No doubt. Tell me about some favorite data science projects.

Randy Lao: The two that I would like to talk more about are one with Natural Language Processing. That made me appreciate and understand that text data is super dirty. And it takes a lot of time to preprocess, to clean up your data. And by a long time, I mean a long time. Especially with the data I was working with. It was from a Kaggle competition, provided by Mercury, so it had about, I don’t know, a few million rows.

Randy Lao: That made me also appreciate the fact that learning Spark and Hadoop, and these big, the [inaudible 00:05:43] systems, allows you to make your work and life a lot easier. On that project, I learned a lot about NLP. Long story short, data is dirty, and data cleaning is important. Another project that I liked was I took some time and did some in-depth analysis on Employeetober. IBM provided a data set that allowed you to explore employee data.

And what that made me realize is hiring employees, keeping an employee, having that balance of a good working environment is very important. It made me appreciate the things that I see throughout work if I do see people leaving a company. It made me value what goes on in a data set or what goes on in their life. That’s just something that needs to be evaluated. Those are just two of the projects that I wanted to share.

Mike Delgado: That’s awesome, and I love how you talked about the dirty data and the data wrangling you’ve had to do with these big data sets. What I also think is great, Randy, is your curiosity and you’re driven, you’re joining these Kaggle competitions, which I think is important for anybody who wants to be working in data science, because it helps you to level up, to learn new things and also work with a group of people who can help you along the way. I think that’s awesome that you’re doing that.

Mike Delgado: Can you tell us about the boot camps that you’re teaching?

Randy Lao: At the moment, I’m teaching two boot camps. One with USC, and that’s more of a data visualization, data analytics boot camp, and then one with Data Application Lab, and that’s more in the machine learning side. My day-to-day job environment is we have a lot of students. So my goal is to track them, monitor their results, and teach them and support them throughout their whole process. I think a most common thing that a lot of students have, whenever they’re tagging into the field, is the concept of programming.

I get a lot of questions about that, and what I tell a lot of people, too, is … When they’re learning these programming languages like R or Python, and they’re interested in data science, some people make the mistake of learning that language just for data science. They’re just learning the packages. “Tell me what packages do I need, Panda, Matplotlibs? I could learn.” But I think they’re missing a very important aspect of it, the actual programming language itself.

Some people have a lot of problems debugging, understanding why their code’s not working. And this all goes back down to the fundamentals of programming. That’s what I have to emphasize, especially for the viewers. If you’re ever interested in data science, spend some time learning programming itself, as a language. Understand the different types of data structures, some algorithms and the basics. Conditionals, Four Loops and statements, because learning the fundamentals is gonna take you a long way, and it’s like a building block to your success.

Mike Delgado: I think that’s solid advice. For those who are just getting started, are there certain programs or coding languages you would recommend they check out or start to play with?

Randy Lao: Yeah. Right now, data science is still a growing field. I’m not too sure what’s gonna be the future. But at the moment, if you’re interested in gaining a good, valuable asset to any company, the two most common programming languages are Python and R. Also, some familiarity with SAS, possibly, depending on what company you’re working for. But I’d recommend R if you’re more of a statistical analysis type of person, because it was made by statisticians.

I would recommend Python as just a general programming language and if you’re more emphasized on production and machine learning deployment, because Python is very versatile. You can use it just for programming, you can use it for efficiency, and the community there is huge. A lot of libraries are popping up every year, and it’s good support for you to get help.

Mike Delgado: How long would it take the average person to pick up a language like Python or R?

Randy Lao: That depends on the person.

Randy Lao: Depending on their background, but if someone’s new to programming, like they don’t know anything, I would say take a course on maybe DataCamp or maybe go to Codecademy. If you spend maybe three hours a week or three hours a day for a week, it should give you a good sense and a good foundation on what you’re going to be dealing with.

I would recommend Codecademy. I would recommend DataCamp. Those are my main two for programming in general. It’s best to learn this stuff, but you actually learn the most when you’re programing. And that’s what Codecademy provides. It’s a platform where, to proceed onto the next level, you have to code out your homework.

Mike Delgado: Back when I was in college, a long time ago, I took a Visual Basics C+ class, and our tests were handwritten, and it just freaked me out. I couldn’t test anything.

Randy Lao: Yeah. There’s a big difference between writing out the code and writing out the code on paper.

Mike Delgado: Yeah. It was brutal. I barely passed. So what do you say for people like me who struggled or are currently struggling learning a language, and they maybe feel like giving up, but they have a heart for getting into data science, but it’s just like this first big step of learning a language. What advice would you give to them?

Randy Lao: I think this advice can be applicable to anything you learn. With programming, or with learning a new concept, the initial stage is always gonna be the hardest. And this is where a lot of people tend to quit. It can get a little overwhelming. What I would like to say, my biggest advice, is to always keep the big picture in mind. A problem that I see, when I’m mentoring some of my students, is they freak out. They panic on some of the small, nitty-gritty syntax of programming language, which causes them to not handle code  properly.

But I always tell them that this is a normal thing that happens in the coding life. Your main focus should be on the problem you’re trying to solve. Focus on that more rather than just a syntax, because you can Google the answers, stack overflow. That’s literally what all my friends who are software engineers do. They say that 80 percent of their time, they’re stack overflowing or they’re Googling.

Especially as a newbie, you’re gonna be overwhelmed with all these syntax errors. My best advice is to calm down, just Google it, and it’s just a normal process. The more practice you get, the more errors you’re gonna go into, and then the more realization on how you’re gonna solve it.

Mike Delgado: I love that. That’s great advice. Data science is a huge field. You know, those headlines about data science being the sexiest position out there. Can you describe the different roles that fall within data science for people to get a sense of what people are doing in different aspects of the jobs?

Randy Lao: Whenever I teach these people about these concepts, I like to break it down into analogies or something very easy to understand. In this case, I would describe data science as you’re just solving a problem using data. Keep it simple. And there’s a lot of process involved, whether that’s data cleaning, whether that’s getting your data, data gathering.

Whether that’s data pre-processing, whether that’s data modeling. Anything you’re doing with data is some aspect of data science. It can be broken down into three main parts. Depending on your interests, you can branch out into more of an analyst. That’s one. You can branch out more as a data engineer, creating the actual pipelines on how the data’s gonna come in and out of your company, or you could be more focused on the machine learning engineer aspect, creating the models, putting things into production.

Three main focuses, but all have some commonality in regards to skill set, that I would say can be broken down into. You should have some understanding on the math and stats, understanding distributions, understanding when to use the p values, understanding correlations. A good foundation on databases, so SQL, no SQL concepts. Good understanding on some basic math, learned algebra, calculus and a good foundation on programming, whether that’s R, Python or SAS. And again, this is something that I don’t think schools teach, and this is what you learn throughout life. Especially in data science or any job, it’s about people. People skills, communication, having empathy, understanding what problems they’re having, being in their shoes so you can have a better understanding on how to fix their problems.

Mike Delgado: I love that you’re catching on the human element. So often we get caught up with the math, the stats, the computer science portions. What would you say are key personality traits or things that you think make a successful data scientist?

Randy Lao: Data science is a big field. Very broad. But the key to success in this field, since it’s fairly new, and it’s changing a lot, you’ve gotta have that passion of learning, ’cause you’re gonna be learning a lot throughout the day. It’s never-ending learning. And another thing is having a drive.

You’ve gotta have that motivation, you’ve gotta have that drive on finding your purpose in life, finding the value you wanna bring to the community, whether that’s maybe on the education platform, helping people out. Whether that’s on marketing, whether that’s building a product for your community. That’s my two key takeaways on waking up every day and having a reason to go on through your day.

Mike Delgado: No doubt. The things I see you doing on LinkedIn, you’re constantly learning, you’re constantly sharing what you’re learning, and I love your approach and how you break it down to the simplest analogies to help explain things.

Randy Lao: Oh, thank you.

Mike Delgado: I think that’s extremely helpful. Especially for anybody like myself, who’s just beginning to learn more about data science. I also love that you touched on the passion and the drive, because, like you said, working on a data project for the very first time or just getting started can seem overwhelming. I definitely know what that feels like.

Mike Delgado: I only took that one class, and I was like, “I’m out of here.” Tell me about the level, for someone who’s interested in taking a class or going to a boot camp, somebody who has no background at all, they haven’t taken a college math class in a long time. Where should they start?

Randy Lao: This is a question I asked myself a few months back, and my advice is, for data science, you should have some … Nothing too much, depending on your role. If you’re a heavy machine learning enthusiast, you might have a stronger grasp on these concepts. But for a general majority of data science as a whole, some understanding of linear algebra, and I would say read up ’til understanding eigenvalues, vectors. That should be plenty. Which is maybe 10 chapters in.

Then I would say focus on some calculus concepts, up to maybe integrals, because most of the calculus used in machine learning is all about optimization problems. I would also say have a heavy emphasis on stats and probability. Because, if you think about it, the data you’re working with is just a sample of the outer population of what you’re trying to gather data out of. Your goal as a data scientist is to get a sample that best represents the population you’re trying to analyze.

Having some basic foundations on stats, probability, linear algebra, calculus will get you far. I would say Khan Academy would be my number one. It’s my favorite, on YouTube. That’s about it for the heavy math.

Randy Lao: Khan Academy is organized, and it’s free.

Mike Delgado: Can’t go wrong with that.

Mike Delgado: So that would be your recommendation for the math, stats side. What about the programming side?

Randy Lao: I would highly emphasize either DataCamp or Codecademy. Back to what I said before. Coding is literally the backbone of what you’re gonna be doing as a data scientist, because all of the things that you’re gonna be handling, all the data, is gonna be in your computer. The way to manipulate it and have a hands-on approach with working with data is programming. You have to learn the basics. A lot of people say you should learn R or Python.

For a newbie, I say Python. Just pick one. Don’t switch around. Pick one, learn it as a programming language itself. That’s one and two. Now you’re ready; now you can start dabbling with the packages available. I’ll recommend learning Panda for data manipulation, non-PII for any manipulation of your data for transformations, which you might need for machine learning. And then, if you’re into machine learning, you can move into deep learning. Either TensorFlow or Keras.

Mike Delgado: I just read this article. I saw it on LinkedIn. It was an article from AARP. It features this 82-year-old woman in Japan who took up coding. She was already retired; she just wants to keep learning.

Mike Delgado: Yeah, 83 years old, she ends up learning to code and created a mobile app for seniors, like a gaming app for seniors.

Randy Lao: If you could send me that link, I would love to read it.

Mike Delgado: I’ll find it and send it to you. That was so encouraging and motivational to me. ’Cause here she is, she could be spending the rest of her time relaxing, and she’s like, “I wanna keep learning” [crosstalk 00:22:46], at 83. “I’m gonna learn how to code.”

Randy Lao: If that doesn’t motivate you guys to code, then I don’t know what.

Mike Delgado: It put me to shame

Mike Delgado: It was so cool. So another big question that we get a lot in our Facebook community is people are wondering, “After I go through the boot camps, I get certified as a data scientist, how do I get my first job? How do I get my foot in the door?”

Randy Lao: This is what I emphasize. Back to either blogging, LinkedIn or just having a social presence. After you’ve done all of this, after you’ve done the boot camps, after you’ve learned the foundations, this is where you have to start creating things. Creating artifacts. Things that have lasting value. ’Cause when you create things and you post it, whether it’s a blog, a post, [inaudible 00:23:51], that’s lasting value that people can see that you’ve done.

That’s what they need for evidence. Because you can have all the best skills in the world, but it’s all in your head. But if you don’t take the time to share with the public, share with the community, provide the value back, that’s something that’s gonna hinder you, and you’re gonna lose a lot of opportunities that people are in need of. This is another thing, too. It’s not very selfish. You’re providing use to the community as well. You’re providing your ideas, you’re providing your analysis, you’re providing your creativity that other people might need to get inspired by.

This is where I like to emphasize LinkedIn as a great platform to not only share your work, but you’re also learning from others as well, with the same common interests. In my case, machine learning and data science. That’s something I would say about that.

Mike Delgado: What’s good about that, Randy, is that no matter if you’re just starting out, I mean that would be a prime time. You may not have much to say, but you have a lot of questions, and you just want to start to build relationships. Even if you’re just starting out, you can start blogging about your route to becoming a data scientist. That can be a huge encouragement to others who follow your blog, and then for you, it’s like your own personal diary.

Randy Lao: Yeah. And I’d like to add onto that as well. Believe it or not, you have to ask a lot of questions. You’ll be surprised how many people are out there too who are willing to help you out. Also, especially when you’re new, my advice is to find a mentor. Whether that’s from a book, whether that’s from this podcast, whether that’s from someone’s LinkedIn profile, you have to find someone who’s been there and done that and then try to copy them.

Because in life, if you don’t have a mentor, you’ll be up and down, failing and succeeding, failing and succeeding. A mentor gives you a little fast track. Instead of ups and downs, it’s like a linear path. It saves you time. They’re gonna be there to help you out and tell you what to do and what not to do. I’m learning from you right now, as a mentor. I’m learning from everybody on LinkedIn. I’m learning from people on YouTube. Think of every connection and every person you’re meeting as a learning opportunity.

Mike Delgado: If you were reaching out to somebody to be a mentor, someone you had talked to once a month, how would you reach out to that person? ’Cause I feel like sometimes that person’s gonna be giving up their time, right?

Mike Delgado: To mentor you. And you wanna find, also, someone who’s gonna be the right fit. What advice do you have for finding a mentor?

Randy Lao: I receive a lot of messages on LinkedIn, and I’ve talked to a lot of other data scientists who’ve been receiving messages as well. A common mistake, or I wouldn’t say a mistake, but something that won’t give you a good success rate in regards to a response back is just asking them straight off “I need a job” or “Can you help me out?” without providing some initial value back to them. My advice would be you have to do your work first. You have to find out what you need and you want, and you have to provide some sense of, “I’ve tried.”

Start contributing back, and you’ve got to be very consistent. Just because someone hasn’t responded back, it might be a busy day. Keep that in mind, have that reflection of that person, and just be consistent. You have to provide value first, and also look them up as a person. Don’t just ask that person for your sake. The transaction of a relationship is never win-lose, lose-win. It’s always gonna be a win-win. What can you give them, and what can they give you?

Mike Delgado: That’s good. I think that’s awesome advice, and I think the trouble sometimes is somebody who’s new, they don’t know what they can give a mentor.

Mike Delgado: You know what I mean? So how would you, if you were just starting out and you’re looking for a mentor … Obviously, you’re always giving back. That’s just your nature. How would you approach the mentor? You’re trying to think, “How do I add value? How do I give back to this person?”

Randy Lao: This is where I started. It all goes back to Meetup and LinkedIn. I was completely new, but what I saw worked best is because you’re new, and you’re not new to the data community, no one knows who you are. You have to start being consistent in regards to posting your interests and questions and curiosity about the field, whether that’s data science or machine learning. That, at the very least, shows that you’re trying and that you’re motivated to learn.

Having that mindset, people are gonna notice that. When you start tagging people on LinkedIn with quality questions, you’re gonna get answers easily. It all goes back to you, on how you tailor your response and how you tailor your questions. If it’s of quality, of course we’re gonna be answering it. But if it’s very generic and doesn’t seem like you spent a lot of time writing it, then of course the feedback won’t be as great.

Mike Delgado: Do you have any recommendations for those who are just getting involved in data science? ’Cause you mentioned posting questions. I’m curious about any favorite online communities you’d recommend people check out.

Randy Lao: Yeah, definitely. I’ve done a lot of research about the different data sites out there, and my few favorites that I look up at least once a week would be Towards Data Science on Medium. Very good articles made by a lot of good professionals, and you get a lot of creativity and inspiration from these posts. Another one would be Analytics Vidhya, which is an Indian online community for data science. Very popular. A lot of good resources there. Another one of my favorites would be DataCamp. Good for learning data science as a whole. Another one would be Codecademy — again, for learning programming. For more of the math side, I would say Khan Academy. Those would be my five recommendations.

Mike Delgado: Those are great. I love that you included some blogs, because some people don’t think of blogs as being a social network but get involved in the comments.

Randy Lao:  Another thing that I think might make the most impact is Kaggle. Especially when you’re new to the field, you have to understand what people are doing, and Kaggle is an online data science platform where people are allowed to upload their notebooks, which is literally the codes that they’re writing themselves. It’s codes from the experts, the machine learning experts, the experts in analysis. Use that time to read through the codes and understand what they’re doing so you can get a sense on how they’re thinking. That’s another thing, too. It’s good to know these tools, but there’s a difference in understanding when to use it and why to use it, which makes you become a data scientist. A way of thinking and tackling different problems.

Mike Delgado: Well, Randy, I am so bummed ’cause this half-hour is up and you have been a phenomenal guest, providing so many wonderful insights. Again, props to you for all the work that you’re doing to encourage, motivate our data scientists. I’m walking away from this #DataTalk just excited, and —

Mike Delgado: Before we go, can you let everyone know how they can get in touch with you?

Randy Lao: First, before I say that, I just wanna thank Mike and everybody who’s watching. Thanks so much for allowing me to be here. I’m glad to share some of my knowledge with you guys. Back to reaching out, you can connect with me through LinkedIn. I’m very active on that. My account is Randy Lao Sat. Message me anytime, leave a note, and also connect with Mike. Leave a note to him, because it’s all thanks to him for making this happen.

Mike Delgado: Thanks, Randy. I’m gonna put up on the screen here quickly, for those who are watching live, the … OK. There’s our URL. This is gonna point to the Experian blog, where I’ll have links over to Randy’s work. Randy, after this broadcast, can you send me any other links you want me to post, like Kaggle or Meetup? I’ll make sure to add it there. That way they can follow you across these different networks. But again, for those listening to the podcast, it’s just ex.pn/datalk48, and that’s where you can get connected with Randy. And again, follow him, and, like Randy was saying, engage in the conversation. If Randy posts a question or he posts something, the best way to get to know Randy is just to start the commenting on his posts and add value. Add value to his network by doing that. Randy, thank you again for your time. We’d love to have you back, anytime you wanna come back.

Randy Lao: Of course. Any time.

Mike Delgado: This has been a great discussion. For those watching for the first time, we do data talks every single week. The day and time varies, based on when our guest is available, but we cover all things machine learning, artificial intelligence. In fact, we had one earlier dealing with how scientists at MIT are using machine learning to find planets, which was a fascinating discussion.

Mike Delgado: ’Cause machine learning is where things are headed. So take Randy’s advice. Get in now, if you’re interested. Randy, thank you so much for being our guest, and we’ll chat soon.

Randy Lao: Definitely. All right. See you later, everybody.

Randy Lao is a Machine Learning Teaching Assistant at the Data Application, Data Analytics Teaching Assistant at UCS’s Viterbi School of Engineering and Market Research Analyst at IDEAS (formerly Data Science Association). Check out his work on Kaggle.

Images Powered by Shutterstock