Machine learning is the secret sauce behind speech recognition, self-driving cars and other fascinating technologies. But what exactly is machine learning, and where can it be applied?
The only thing that computers (that is the "machine" part of machine learning) do is executing algorithms. An algorithm is a fixed sequence of steps. Computers aren't creative, nor can they come up with new ideas. How come they then can learn anything at all?
For any given task there are countless conceivable algorithms, some better and some worse. If a computer is instructed to evaluate algorithms one by one and keep track of the best algorithm it has encountered so far, its performance in the task will increase over time as it stumbles upon better algorithms. It seems as if the computer is gradually learning to master the task. This is the core idea behind machine learning.
Let's consider a spam filter, which was an early success case for machine learning. A spam filter takes an email message as an input, processes it (somehow), and finally outputs a label: "spam" or "not spam". One could try to write a spam filtering algorithm by hand. Perhaps collect a list of spammy words ("casino", "free", "genuine", etc.), count how many of these words appear in an email message, and say that a message is spam if the count exceed a certain threshold. However, there is no straightforward criterion of deciding which words should be on the list ("free" is likely to appear in a legitimate message, too) or the threshold value. Machine learning, on the other hand, can diligently evaluate millions of word list variants and pick the one that most accurately detects spam.
So far we have seen that machine learning boils down to a selection of a well-performing algorithm. But how do we measure the performance? A fair way to compare algorithms is to evaluate them on a data set representative of the task the computer is supposed to learn. Typically, this training data is a set of input examples and corresponding human-annotated expected outputs. For example, a set of email messages that a person has read and annotated either as "spam" or "not spam" would work as training data for learning a spam filter.
Candidate algorithms are ranked based on how accurately they replicate the expected outputs when they are fed input data instances. Of course, ultimately we are interested on how well an algorithm will perform on inputs which are not part of the training (after all, we already know the correct output for all the training instances). If the training data is representative and the evaluation is done carefully, there are guarantees that the performance generalizes to previously unseen data points.
Where do the candidate algorithms come from? A data scientist, who is setting up the learning task, has to decide which algorithms the computer should consider.
The choice of which algorithms to evaluate is affected by two, often contradictory, considerations: the set of candidate algorithms should be flexible enough to capture essential features of the task (such as information about which variables depend on each other), and, at the same time, the search over the candidates should be mathematically tractable. Often, the candidate set is selected so that a comparison between algorithms reduces to a mathematical optimization problem. Optimization is a heavily studied field, and plenty of tools exists for solving optimization problems effectively.
The image below visualizes the task of learning a simple spam filter. It depicts emails that are collected for training purposes and three instances of competing algorithms. The training emails are shown as dots. The solid dots are spam and the open dots are non-spam messages. In this example, emails are described by just two variables that are assumed to be related to their degree of spamminess: the number of words in an email (vertical axis) and the number of exclamation mark in an email (horizontal).
I have decided to restrict the search space of algorithms to lines. A line defines a spam filter: emails that fall on one side of the line are classified as spam and the ones on the other side as non-spam. The three pictures show three examples out of infinitely many possible ways of drawing a line on the plane.
Machine learning now turns into an optimization: choose a line a line so that as many of the solid dots as possible fall on the one side of the line and as many open dots as possible on the other side. Obviously, the rightmost picture is the optimal solution, and hence the most accurate spam filter, because it cleanly separates the two types of dots.
Of course, beyond this simplified example, there are usually more than two variables, and the data might not separate so cleanly into two regions. These introduce some additional challenges, but the basic principles of learning stay the same.
Some tasks are more suited for machine learning that others. Machine learning depends on the availability of training data. Therefore, tasks with many easily available training instances are particularly suitable for machine learning. In cases like natural language translation appropriate data is readily available because a huge number of texts have already been translated by human translators.
Machine learning can conquer some tasks that are complicated for humans because machines are good at searching through huge spaces of algorithms. Computers learn to beat humans at games like Go by systematically exploring games' vast state spaces.
An interesting class of tasks are those that humans can carry out instinctively but attempting to program a computer to do the same is hard because formulating the thought process as an explicit algorithm is next to impossible. Examples include tasks such as driving a car or coloring black-and-white images. With machine learning, computers can learn to perform these tasks.
Computers don't learn by themselves. A human data scientist must express the learning goals in a format a computer understands. It can be tricky to formulate a learning task so that the computer is able to learn efficiently. Usually, a lot of time is spent on preprocessing the raw data so that a learning algorithm can take advantage of it. Trying to use machine learning for everything is not feasible; humans are still best at tasks that require real ingenuity and flexibility.
It is very easy to get fooled by spurious correlations. If a spam filter is trained on a small data set, it may happen by chance that some common English word, such as "you", occurs in spam messages but not in any of the non-spam messages. A naive spam filter would learn to associate the word "you" with spam and therefore make foolish classifications. Guarding against such spurious patterns demands rigorous controls on the learning process.
Instead of trying to formulate an algorithm for a particular task by hand, machine learners climb higher on the ladder of abstractions and introduce a new trick: they let the computer search for a well-performing algorithm. This approach can be taken if there is suitable data available, and if the search for a good algorithm can be formulated in a mathematically tractable way. When these conditions are met, machine learning enables us to find solutions to many difficult tasks where developers struggle to write algorithms directly.