Data Science is one of the most exciting fields of the 21st century. The power to understand human voices, recognize faces, understand text, and much more has vast applications across industries. In this article, we’re going to define some of the most important terms in more detail.
Data Science is a multifocal field, consisting of an intersection of Mathematics, Statistics, Computer Science, & Domain Specific Knowledge of a specific field.
It is the application of Mathematics and Statistics to real-world problems to solve them faster using computers. It involves Software Development (the software which solves the problem), Machine Learning (to train the machine using mathematics), and Traditional Research (to make mathematical assumptions about the problem).
Data Science is applied to those problems where traditional algorithms will take too long to solve, or won’t be able to solve them as accurately. For example, predicting an area’s housing prices, controlling an autonomous vehicle’s movements, recognizing human faces, etc.
A dataset is a particular instance of data that is used for analysis or model building at any given time. A dataset comes in different flavors: numerical data, categorical data, text data, image data, voice data, and video data. For beginning data science projects, the most popular type of dataset is a dataset containing numerical data that is typically stored in a comma-separated values (CSV) file format
Data wrangling is the process of converting data from its raw form to a tidy form ready for analysis. Data wrangling is an important step in data preprocessing and includes several processes like data importing, data cleaning, data structuring, string processing, HTML parsing, handling dates and times, handling missing data, and text mining.
It is one of the main tools used to analyze and study relationships between different variables. Data visualization (e.g., scatter plots, line graphs, bar plots, histograms, Q-Q plots, smooth densities, boxplots, pair plots, heat maps, etc.) can be used for descriptive analytics. Data visualization is also used in machine learning for data preprocessing and analysis, feature selection, model building, model testing, and model evaluation.
An outlier is a data point that is very different from the rest of the dataset. Outliers are very common and are expected in large datasets. One common way to detect outliers in a dataset is by using a box plot. Outliers can significantly degrade the predictive power of a machine learning model.
Advanced methods for dealing with outliers include the RANSAC method.
Most datasets contain missing values. However, the removal of samples or the dropping of entire feature columns is simply not feasible because we might lose too much valuable data. So, here we can use different interpolation techniques to estimate the missing values from the other training samples in our dataset. One of the most common interpolation techniques is mean imputation, where we simply replace the missing value with the mean value of the entire feature column.
Scaling your features will help improve the quality and predictive power of your model. The model will be biased towards a particular feature without scaling your features. To bring features to the same scale, we could decide to use either normalization or standardization of features.
In machine learning, the dataset is often partitioned into training and testing sets. The model is trained on the training dataset and then tested on the testing dataset. The testing dataset thus acts as the unseen dataset, which can be used to estimate a generalization error (the error expected when the model is applied to a real-world dataset after the model has been deployed).
Artificial Intelligence is a term that is sometimes incorrectly used to refer to all Data Science problems. AI is referring to computer systems or programs that are intelligent enough to be able to complete tasks that generally require human intervention.
A computer system in AI is known as an “intelligent agent”, which takes input from the environment and performs actions that maximize the chances of achieving its goals. Improving the learning, knowledge representation, perception, and object manipulation capability of these “intelligent agents” are the short-term goals of Artificial Intelligence.
Artificial Intelligence is further subdivided into Narrow AI & General AI. Narrow AI is only capable of doing a single task well such as understanding human speech, playing strategic games like Chess or Go, or operating vehicles autonomously. In contrast, a General AI should be capable of completing any task given to it, like a human. The long-term goal of AI is to create Artificial General Intelligence (AGI).
Considered a subset of Artificial Intelligence, Machine Learning is the application of statistical models and algorithms to solve problems by finding patterns and inferences.
Given the large amount of data being generated in modern times, the objective of ML is for machines to be able to learn from the data itself, without human intervention or assistance. Data is the fundamental basis for all Machine Learning. If there was very little data or no data at all, the ML algorithms wouldn’t be able to derive useful inferences.
Therefore, in Machine Learning, the algorithm learns to solve the problem by itself using data, contrasting to a traditional program, where a programmer has to explicitly write a set of instructions. Applications of Machine Learning are in building recommender systems, clustering data points with similar characteristics together (for example, clustering customer data into market segments), or predicting future values based on past data (for instance, predicting stock or market price).
Machine Learning can either be Supervised, Unsupervised, or Reinforced.
a) Supervised Learning: These are machine learning algorithms that perform learning by studying the relationship between the feature variables and the known target variable. Supervised learning has two subcategories:
b) Unsupervised Learning: In unsupervised learning, we deal with unlabeled data or data of unknown structure. Using unsupervised learning techniques, we can explore the structure of our data to extract meaningful information without the guidance of a known outcome variable or reward function.
K-means clustering is an example of an unsupervised learning algorithm.
c) Reinforcement Learning: Reinforcement Learning (RL) is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its actions and experiences.
Reinforcement learning uses rewards and punishment as signals for positive and negative behavior.
Deep Learning is the application of Artificial Neural Networks to imitate the workings of the human brain. It is considered a subset of Machine Learning because it uses data to learn features, inferences, and patterns automatically.
Unlike in Machine Learning, feature engineering, which is an important and difficult step, is automatically taken care of in Deep Learning. In Deep Learning, “deep” layers of Artificial Neural Networks are used to progressively extract higher-level features from raw data.
Cross-validation is a method of evaluating a machine learning model’s performance across random samples of the dataset. In k-fold cross-validation, the dataset is randomly partitioned into training and testing sets. The model is trained on the training set and evaluated on the testing set. The process is repeated “k” times. The average training and testing scores are then calculated by averaging over the k-folds.
A model having high bias and low variance assumes more assumptions about the form of the target function, and a model having high variance and low bias overlearns the training dataset. The parameters of the model should be tuned to get the best fit model, that performs the best in production.
Principal Component Analysis (PCA) is a statistical method that is used for feature extraction. It is used for high-dimensional and correlated data. The basic idea of PCA is to transform the original space of features into the space of the principal component.
Linear Discriminant Analysis is a dimensionality reduction technique that is commonly used for supervised classification problems. It is used for modeling differences in groups i.e. separating two or more classes. Just like the PCA, It is used to project the features in higher dimension space into a lower dimension space.
a) Model Parameters: These are the parameters in the model that must be determined using the training data set. These are the fitted parameters.
For example, suppose we have a model such as;
To estimate the cost of houses based on the age of the house and its size (square foot), then a, b, and c will be our model or fitted.
b) Hyperparameters: These are adjustable parameters that must be tuned to obtain a model with optimal performance. Some examples of hyperparameters in machine learning: Learning Rate, Number of Epochs, Regularization constant, Number of branches in a decision tree, Number of clusters in a clustering algorithm (like k-means)
It is important that during training, the hyperparameters be tuned to obtain the model with the best performance (with the best-fitted parameters)
In machine learning (predictive analytics), several metrics can be used for model evaluations. A supervised learning (discrete target) model, also referred to as a classification model, can be evaluated using metrics such as accuracy, precision, recall, f1 score, and the area under the ROC curve (AUC).
a) Basic Calculus: Most machine learning models are built with a dataset having several features or predictors. Hence, familiarity with multivariable calculus is extremely important for building a machine learning model. Here are the topics you need to be familiar with: Functions of several variables; Derivatives and gradients; Step function, Sigmoid function, Logit function, ReLU (Rectified Linear Unit) function; Cost function; Plotting of functions; Minimum and Maximum values of a function
b) Basic Linear Algebra: Linear algebra is the most important math skill in machine learning. A data set is represented as a matrix. Linear algebra is used in data preprocessing, data transformation, dimensionality reduction, and model evaluation. Here are the topics you need to be familiar with: Vectors; Norm of a vector; Matrices; Transpose of a matrix; The inverse of a matrix; The determinant of a matrix; Trace of a Matrix; Dot product; Eigenvalues; Eigenvectors
c) Optimization Methods: Most machine learning algorithms perform predictive modeling by minimizing an objective function, thereby learning the weights that must be applied to the testing data to obtain the predicted labels. Here are the topics you need to be familiar with: Cost function/ Objective function; Likelihood function; Error function; Gradient Descent Algorithm and its variants (e.g., Stochastic Gradient Descent Algorithm)
Statistics and Probability are used for the visualization of features, data preprocessing, feature transformation, data imputation, dimensionality reduction, feature engineering, model evaluation, etc. Here are the topics you need to be familiar with: Mean, Median, Mode, Standard deviation/ variance, Correlation coefficient, the covariance matrix, Probability distributions (Binomial, Poison, Normal), p-value, Bayes Theorem (Precision, Recall, Positive Predictive Value, Negative Predictive Value, Confusion Matrix, ROC Curve), Central Limit Theorem, R_2 score, Mean Square Error (MSE), A/B Testing, Monte Carlo Simulation
Regularization is a technique used to reduce errors by fitting the function appropriately on the given training set and avoiding overfitting. The commonly used regularization techniques are;
a) L1 regularization: LASSO (Least Absolute Shrinkage and Selection Operator) Regression adds “absolute value of magnitude” of coefficient as penalty term to the loss function(L).
b) L2 regularization: Ridge Regression adds “squared magnitude” of coefficient as penalty term to the loss function (L)
c) Dropout regularization: Dropout is a regularization technique, which aims to reduce the complexity of the neural network algorithm to prevent overfitting. As a consequence, the neural network will learn different, redundant representations; the network can’t rely on the particular neurons and the combination (or interaction) of these to be present. Another nice side effect is that the training process will be faster.
We have defined some of the typical Data Science terms and hope that the terminologies are getting more clear with the explanations. There are many more terms, which we will discuss in future articles, such as: