Introduction
We all know that feeling when you have to learn something completely new. The issue overwhelms you and the problem seems like rocket science. You have no idea where to start or who to ask for help. You have to do something but you don’t know what so you spend your time on Facebook or start cleaning your house. I bet that university students know best what I’m talking about.
Learning data science from the basics is pretty similar. I can tell you, because I was in this situation before. I dived into data science a few years ago, having no idea what exactly it is for and how to learn it. What I knew was that data science is about solving problems and it was going to be ‘the sexiest job of the 21st century’. I was young and naive so I went for it. Afterwards I must admit that I don’t regret it. Anyway this text is not about why you should become a data scientist but how to become one. If you were told similar things and you wonder how the hell to start doing this ‘sexy data science stuff’ and how to find your first job, I’ll try to help you a bit.
What do you really need to learn before your first job interview?
I assume that you already know what data science is and you understand the basics. The next step would probably be finding a data science job. The question is, are you good enough to get one? In my opinion, the best way to find it out is to head out for some interviews. My application was rejected many times but this only made me stronger. Naturally, not everyone has so much time or patience to go for dozens of meetings. So here it is: What do you have to learn to get your first data science job? It’s obvious that each company needs a different skill set. But after many job interviews, I’m able to point out some must-to-know areas for every data scientist. Let’s get to the point.
Must have for each data scientist
Basic knowledge
‘Basic knowledge’ includes all the concepts from data science and statistics like probability distribution, distribution function, classifier, supervised and unsupervised learning and many more. You need to know the theory very well. When somebody wakes you up in the middle of the night you have to be able to list basic classifier types or explain concept of Bayesian analysis. I’ll try to list some of them, but there are just too many to mention them all. The good thing is, while learning the following ones, you’ll get to know many more others:
Probability distribution and its basic types (Normal, Log-normal, Poisson, Discrete, Bernoulli),
Variables types (Continuous, Categorical etc.),
Statistical tests and their use (T-test, Chi-square, ANOVA, etc.)
Supervised & unsupervised learning,
Bias-Variance Tradeoff (under- and over-fitting),
Types of machine learning tasks (Big 3 -> Regression, Classification, Clustering),
Entropy,
Bayesian analysis.
2. Algorithms
It’s almost certain that you will get a question like “What machine learning algorithms do you know and could you explain them?”. Don’t worry, you don’t have to know all the algorithms that exist. It’s not even possible as there are hundreds of them. Concentrate on the basic ones, like:
Naive Bayes,
Regression (Linear & Logistic),
Decision Trees,
Neural Networks.
It’s also good to know some less common ones. Find 2–3 recently popular ones, which seems attractive to you and learn how to explain them in a few sentences. This extra ones could be for example:
Regularized Regression (Ridge, Lasso, Elastic Net),
Support Vector Machine,
k-Nearest Neighbors
ensemble methods (Random Forests, Gradient Boosting, …..).
Where to get them know? You need to figure it out by yourself but there are plenty of articles on the Internet so it won’t be so hard. Remember that you’ll have to explain these algorithms during a job interview, so you need to really understand them well.
3. Model evaluation
This is second area where you need to feel comfortable. You can be sure that you will be asked a question more or less like ‘How would you evaluate your model?’. It’ll be the introduction to more specific questions, where you’ll have to show your knowledge about following concepts:
ROC curves,
LIFT curves,
AUC (Area Under Curve),
confusion matrix with all it’s derivations like TP, TPR, FPR and more,
Kolmogorov-Smirnov (KS) Goodness-of-Fit Test,
Gini measure.
4. Feature selection
Generally when you build a model, feature selection is one of the first steps. During an interview, feature selection may appear later, after questions about algorithms and model evaluation. We generally use two types of methods:
Filter methods
where features are selected on the basis of their correlation with the outcome variable,
Wrapper method
s, where you train a model using a subset of features and then choose the best subset.
Filter methods
It’s independent of any machine learning algorithms. Features are selected on the basis of their correlation with the dependent variable. We simply check how valuable they are. You can use different methods for continuous and categorical variables, which means that you have to be familiar with:
Pearson’s correlation,
LDA,
ANOVA,
V-Cramer,
Chi-square.
Wrapper methods
Common examples of wrapper methods are forward feature selection, backward feature elimination, recursive feature elimination, etc.
Summary
So this is the basics. I must upset you, it’s just the beginning. Data science is a very wide and complex area. But still, this knowledge increases your chances to get first data science job. What if you don’t know more than 50% of these terms? It means that you need to expand your knowledge which you can basically do in two ways: enrolling into a university and/or learning on your own.
University degree or learning on your own?
I’ve heard that question many times and each time the answer was different. It doesn’t mean that I was changing my mind each time. It means that the answer depends on the characteristics of the person who asks. I have to emphasize that I’ve graduated Big Data faculty quite recently, so I am aware of what institutional education offers in terms of data science. What is important, my knowledge is quite up-to-date. So, who is who?
University
Data science studies could be very useful for most young people with limited experience. When you are young, you generally have a lot of doubts, you often change your mind and you don’t have enough patience. At least I was like this. The good thing about studies is that they are based on syllabus. Syllabuses are better or worse but generally they are well-thought-out and rather complex. It ensures that you will learn an essential set of skills step-by-step. Improvements require regularity. Undertaking studies ensures receiving complex knowledge which you can deepen on your own. What is more you are going to study with other curious people, with whom you can exchange your first experiences. It can be very motivating. Especially when you still don’t know what this whole ‘data science’ is exactly about.
Online courses
The problem with the studies is that they last long, they are rather impractical and outdated. I expect that you want to become a data scientist in less than, let’s say half a year, but definitely not after two years. The bad news is that it’s not really possible. Unless…you already have strong statistics, programming or mathematics experience. You may also be a very gifted, fast-learning person, but it’ll still take you rather 1–2 years to become a real data scientist. Anyway, for these gifted, experienced and/or hard-working people it’s possible to learn data science on their own. There are plenty of machine learning courses available. The most popular are probably Coursera and Udacity.
Remember that you have to really understand the terms mentioned above. Learning by heart won’t work. Read about these concepts, find a practical example and try to implement them. The best way of learning data science is by doing it.
Good luck!