Some mathematical preliminaries
In the field of Machine Learning, there are two1 predominant strategies to train a machine learning algorithm: supervised learning and unsupervised learning. Common to both strategies is the idea of an example (e.g., an email, an image, a video snippet, etc.) and the translation of these examples into features (e.g., the sender, receiver, and subject of an email). Exactly how this translation works is outside the scope of this article and I leave it as an exercise for the reader.
From a high-level perspective, in supervised machine learning, the algorithm is provided with a set of examples such that each example has a label (e.g., whether an email is spam) and the algorithm attempts to “figure out”2 how to map these examples to their corresponding labels. Conversely, in unsupervised learning, an algorithm is given only the examples and attempts to find patterns and similarities between the provided examples.
How about a “real” scenario?
But let’s step back and compare these two strategies from a concrete scenario: sorting a large bin of mixed produce at a grocery store. For simplicity, we will say that both algorithms will use the same set of features to sort the produce: size/shape, weight, colour, skin texture/pattern, and the presence of a stem. While these might not be an exhaustive list, they provide enough flavour for our purposes.
Sorting produce using Supervised Learning
In our scenario, we must first provide the supervised machine learning algorithm with examples of each produce type we wish to identify and label. Accordingly, we might take some of the fruit out of the bin, label it ourselves, and show these labelled examples to the algorithm. The algorithm will internally learn to weigh each feature (and their value) differently for each possible label (e.g., an orange fruit is unlikely to represent an apple). Once the algorithm has determined how to set these feature weights, we may begin showing new fruit from the bin to the algorithm and receiving labels from it. The selected label is often the one whose feature weights combined with the features of the new example maximize some score 3. Accordingly, the label that the algorithm provides is an approximation that is based upon what it was trained on and the resemblance of those examples to the new one (e.g., if you never showed a grapefruit to the algorithm previously, it may label it as an orange).
Sorting Produce using Unsupervised Learning
With the unsupervised algorithm, we have a much easier job. We simply feed all of the examples (i.e., produce) into the algorithm and it can sort them into different bins based upon the similarity of their features. The number of bins may be predefined (e.g., if we know that there will be 50 types of produce) or may be filled as needed (e.g., maybe a shipment only contained pears and oranges). The key difference is that we don’t know what produce went into which bin. We must examine each bin and we must provide some understandable label to the bin as a whole. Thus, the algorithm didn’t label any produce, it just put like with like4. This means that there is additional work to be done after putting the produce through the algorithm to figure out what actually happened. This could be easy if the produce is vastly different but harder the more similar the produce is (e.g., separating naval oranges and grapefruit).
But what about errors?
Regardless of which algorithm we use, there will invariably be mistakes made by either algorithm, whether this is calling an Asian pear an apple or labelling a cucumber as a zucchini. Both algorithms are capable of making similar types of errors. While we can tune either algorithm to do better at this task, part of this will depend on how good (more precisely, how discriminative) the features that we choose are. However, that’s a topic for another blog.
So what do we pick?
Ultimately, when we choose between a supervised or unsupervised method, we seek to balance a set of trade-offs. Supervised methods require an upfront cost to find and label examples to be learned from but unsupervised methods do not. On the other hand, unsupervised methods have the cost on the tailend and bins may have to be manually labelled or described based upon manual inspection. This process can be long and tedious, especially when we have many resulting groups of similar examples. The other trade-off is in extensibility and reusability, the up-front cost of a supervised method is mitigated when we can reuse it on the next project. In our running scenario, using the unsupervised approach would likely yield different bins on a different shipment of produce and require new manual work; while the supervised method may be run with little to no refinement (e.g., if we know there are no new types of produce).
Ignoring for a moment the effort involved in either approach, we might more generally say that supervised machine learning algorithms are best used when we know a priori what (most of) our labels will be. Unsupervised methods are then best suited when we have little to no prior knowledge of what those labels may be. How we choose what method is best is often a combination of one’s effort budget and the amount of prior knowledge they have about their task and data.
What does Kira do?
Kira’s Quick Study features utilizes a Supervised Machine Learning approach to help our users extract information from their contracts and other documents. We chose a supervised approach due to the fact that it allows users to have more control of the process end-to-end with less effort and higher accuracy than we found with other solutions. In fact, our core Quick Study technology is both patented and has gone through academic peer-review. A process we believe helps to ensure that we produce quality results for our users.
1 There are technically three (and several flavours in between) but this article is ignoring the third for simplicity.
2 In mathematics, we would say that the algorithm attempts to induce a function that maps examples to labels.
3 What this score represents is dependent on the algorithm used. For example, it could represent the probability of the produce having the selected label.
4 More accurately, what produce it statistically thought were most similar.