In a very high-dimensional space, supervised algorithms need to learn how to separate points and build a function approximation to make good decisions. When the features are very numerous, this search becomes very expensive, both from a time and compute perspective. In some cases, it may be impossible to find a good solution fast enough.
Supervised and Unsupervised Machine Learning Algorithms
This problem is known as the curse of dimensionality , and unsupervised learning is well suited to help manage this. With dimensionality reduction, we can find the most salient features in the original feature set, reduce the number of dimensions to a more manageable number while losing very little important information in the process, and then apply supervised algorithms to more efficiently perform the search for a good function approximation.
Feature engineering is one of the most vital tasks data scientists perform. Without the right features, the machine learning algorithm will not be able to separate points in space well enough to make good decisions on never-before-seen examples. However, feature engineering is typically very labor-intensive; it requires humans to creatively hand-engineer the right types of features.
Instead, we can use representation learning from unsupervised learning algorithms to automatically learn the right types of feature representations to help solve the task at hand. The quality of data is also very important. If machine learning algorithms train on rare, distortive outliers, their generalization error will be lower than if they ignored or addressed the outliers separately.
With unsupervised learning, we can perform outlier detection using dimensionality reduction and create a solution specifically for the outliers and, separately, a solution for the normal data. Machine learning models also need to be aware of drift in the data. If the data the model is making predictions on differs statistically from the data the model trained on, the model may need to retrain on data that is more representative of the current data.
By building probability distributions using unsupervised learning, we can assess how different the current data is from the training set data—if the two are different enough, we can automatically trigger a retraining.
- Daring to Feel: Violence, the News Media, and Their Emotions!
- Machine Learning 101 | Supervised, Unsupervised, Reinforcement & Beyond.
- Top Authors?
- Application of unsupervised analysis techniques to lung cancer patient data;
This will help frame where unsupervised learning fits within the machine learning ecosystem. In supervised learning, there are two major types of problems: classification and regression.
In classification, the AI must correctly classify items into one of two or more classes. If there are just two classes, the problem is called binary classification. If there are three or more classes, the problem is classed multiclass classification. Classification problems are also known as discrete prediction problems because each class is a discrete group. Classification problems also may be referred to as qualitative or categorical problems.
Recommended for you
In regression, the AI must predict a continuous variable rather than a discrete one. Regression problems also may be referred to as quantitative problems. Supervised machine learning algorithms span the gamut, from very simple to very complex, but they are all aimed at minimizing some cost function or error rate or maximizing a value function that is associated with the labels we have for the dataset. As mentioned before, what we care about most is how well the machine learning solution generalizes to never-before-seen cases. The choice of the supervised learning algorithm is very important at minimizing this generalization error.
To achieve the lowest possible generalization error, the complexity of the algorithmic model should match the complexity of the true function underlying the data.
We do not know what this true function really is. If we did, we would not need to use machine learning to create a model—we would just solve the function to find the right answer. But since we do not know what this true function is, we choose a machine learning algorithm to test hypotheses and find the model that best approximates this true function i.
If what the algorithm models is less complex than the true function, we have underfit the data. In this case, we could improve the generalization error by choosing an algorithm that can model a more complex function. However, if the algorithm designs an overly complex model, we have overfit the training data and will have poor performance on never-before-seen cases, increasing our generalization error. In other words, choosing more complex algorithms over simpler ones is not always the right choice—sometimes simpler is better. Each algorithm comes with its set of strengths, weaknesses, and assumptions, and knowing what to use when given the data you have and the problem you are trying to solve is very important to mastering machine learning.
In the rest of this chapter, we will describe some of the most common supervised algorithms including some real-world applications before doing the same for unsupervised algorithms. The most basic supervised learning algorithms model a simple linear relationship between the input features and the output variable that we wish to predict. The simplest of all the algorithms is linear regression , which uses a model that assumes a linear relationship between the input variables x and the single output variable y.
If the true relationship between the inputs and the output is linear and the input variables are not highly correlated a situation known as collinearity , linear regression may be an appropriate choice. If the true relationship is more complex or nonlinear, linear regression will underfit the data. Because it is so simple, interpreting the relationship modeled by the algorithm is also very straightforward. Interpretability is a very important consideration for applied machine learning because solutions need to be understood and enacted by both technical and nontechnical people in industry.
Without interpretability, the solutions become inscrutable black boxes. Linear regression is simple, intrepretable, and hard to overfit because it cannot model overly complex relationships. It is an excellent choice when the underlying relationship between the input and output variables is linear.
BE THE FIRST TO KNOW
Linear regression will underfit the data when the relationship between the input and output variables is nonlinear. Since the true underlying relationship between human weight and human height is linear, linear regression is great for predicting weight using height as the input variable or, vice versa, for predicting height using weight as the input variable.
The simplest classification algorithm is logistic regression , which is also a linear method but the predictions are transformed using the logistic function. The outputs of this transformation are class probabilities —in other words, the probabilities that the instance belongs to the various classes, where the sum of the probabilities for each instance adds up to one.
Each instance is then assigned to the class for which it has the highest probability of belonging in. Like linear regression, logistic regression is simple and interpretable. When the classes we are trying to predict are nonoverlapping and linearly separable, logistic regression is an excellent choice. When classes are mostly nonoverlapping—for example, the heights of young children versus the heights of adults—logistic regression will work well.
Another group of very simple algorithms are neighborhood-based methods. Neighborhood-based methods are lazy learners since they learn how to label new points based on the proximity of the new points to existing labeled points. Unlike linear regression or logistic regression, neighborhood-based models do not learn a set model to predict labels for new points; rather, these models predict labels for new points based purely on distance of new points to preexisting labeled points.
Lazy learning is also referred to as instance-based learning or nonparametric methods. The most common neighborhood-based method is k-nearest neighbors KNN. To label each new point, KNN looks at a k number where k is an integer value of nearest labeled points and has these already labeled neighbors vote on how to label the new point. By default, KNN uses Euclidean distance to measure what is closest. The choice of k is very important. If k is set to a very low value, KNN becomes very flexible, drawing highly nuanced boundaries and potentially overfitting the data. If k is set to a very high value, KNN becomes inflexible, drawing a too rigid boundary and potentially underfitting the data.
Unlike linear methods, KNN is highly flexible and adept at learning more complex, nonlinear relationships. Yet, KNN remains simple and interpretable. KNN does poorly when the number of observations and features grow. KNN becomes computationally inefficient in this highly populated, high-dimensional space since it needs to calculate distances from the new point to many nearby labeled points in order to predict labels.garradbradley.com/4945-the-best.php
Supervised and Unsupervised Ensemble Methods and their Applications | Oleg Okun | Springer
It cannot rely on an efficient model with a reduced number of parameters to make the necessary prediction. Also, KNN is very sensitive to the choice of k. KNN is regularly used in recommender systems, such as those used to predict taste in movies Netflix , music Spotify , friends Facebook , photos Instagram , search Google , and shopping Amazon.
For example, KNN can help predict what a user will like given what similar users like known as collaborative filtering or what the user has liked in the past known as content-based filtering.
Instead of using a linear method, we can have the AI build a decision tree where all the instances are segmented or stratified into many regions, guided by the labels we have. Once this segmentation is complete, each region corresponds to a particular class of label for classification problems or a range of predicted values for regression problems.
This process is similar to having the AI build rules automatically with the explicit goal of making better decisions or predictions. The simplest tree-based method is a single decision tree , in which the AI goes once through the training data, creates rules for segmenting the data guided by the labels, and uses this tree to make predictions on the never-before-seen validation or test set.
However, a single decision tree is usually poor at generalizing what it has learned during training to never-before-seen cases because it usually overfits the training data during its one and only training iteration. To improve the single decision tree, we can introduce bootstrap aggregation more commonly known as bagging , in which we take multiple random samples of instances from the training data, create a decision tree for each sample, and then predict the output for each instance by averaging the predictions of each of these trees.
By using randomization of samples and averaging results from multiple trees—an approach that is also known as the ensemble method —bagging will address some of the overfitting that results from a single decision tree. We can improve overfitting further by sampling not only the instances but also the predictors.
- Cardiac Drug Therapy 7th ed (Contemporary Cardiology);
- The Marketing of Evil: How Radicals, Elitists, and Pseudo-Experts Sell Us Corruption Disguised as Freedom.
With random forests , we take multiple random samples of instances from the training data like we do in bagging, but, for each split in each decision tree, we make the split based not on all the predictors but rather a random sample of the predictors.