An Illustration of Software 2.0
In 2017, Andrej Karpathy wrote a seminal blog post where he called neural networks “software 2.0”. He makes the case that neural networks constitute a paradigm shift in software development. The innovation consists in substituting imperative programming with “programming by example”. Instead of writing in detail the operations a program should go through to process some input and produce some output, you just specify a goal for the program and show it many examples. Watching the examples, the program will then “write itself”. In that sense, neural nets constitute “meta-software” — software that produces software.
In his post, Karpathy then goes on to speculate about the consequences of this — an invaluable read. Over the years, I’ve found myself making Karpathy’s point to various people, but I wasn’t sure it really landed with them. One problem with communicating this is that the types of problems we tend to solve with neural nets (computer vision, speech synthesis, natural language understanding, etc.), we were never really close to solving with imperative programming. Many people therefore don’t really see that software 1.0 and software 2.0 are substitutes — or fundamentally different ways of doing the same thing: writing software. It feels like they’re just different things solving non-overlapping problems. And this is mostly true. But, in the future, I think the overlaps are going to grow more than we think.
Recently, I stumbled upon a simple problem that I thought would illustrate well not only the differences between software 1.0 and software 2.0 but also how they compete with each other in solving problems. I will also use that to illustrate the economics guiding the software 1.0 v software 2.0 choice, and what are the tradeoffs involved in that choice.
A friend was preparing for a technical interview on Leetcode and he shared with me one of the problems he had worked on: converting integers into Roman numerals. At present, no one in their right mind would use a neural net to solve this problem (it likely takes longer to code, is slower to run, and consumes more memory and compute). But, in the future, the software 2.0 tooling might be so effective, chips so optimized, and memory and compute so cheap, that it might actually make sense to solve this problem with a neural net. The main reason is that software 2.0 substitutes cheap labor for expensive labor.
Assuming neural nets become off-the-shelf commodities, the bulk of the work involved in programming by example is… collecting examples. As Karpathy puts it, “most of the active ‘software development’ takes the form of curating, growing, massaging and cleaning labeled datasets.” This is arguably more accessible than writing imperative code. Imperative coding essentially requires humans to think like robots. Any sloppiness in reasoning and you get a memory leakage, an infinite loop, or some other unintended consequence. It’s merciless, and it takes years to rewire your brain to avoid all the common pitfalls. Curating datasets, in contrast, is much more human-friendly, and therefore the labor pool is larger.
The second big advantage of programming by example is that the same neural net architecture can be used to solve different problems. The labor that went into building the neural net can therefore be amortized over several tasks. Each different task still requires one to collect different data (“software 2.0 development”), but as we mentioned, this may be much cheaper than writing imperative coding. Software 1.0 is much more “brittle” in that sense. The imperative code to convert integers into Roman numerals is completely different from the imperative code to convert Roman numerals into integers. In contrast, you can repurpose the exact same neural net to solve this new problem. There currently are limits to the versatility of neural nets though. Typically, there are “problem spaces” within which a given architecture can be repurposed. For instance, a recurrent neural net can be somewhat seamlessly repurposed for any problem involving converting a sequence of symbols into another sequence of symbols (this is the case I will illustrate in this post). Similarly, without changes to its architecture, the same convolutional neural net can be repurposed for various image classification tasks — whether it is detecting cats in pictures, or evidence of cancer in X-rays. You just have to feed it the right examples (this is not totally true: while the neural net architecture remains the same, you may also need to tweak its hyperparameters). While the versatility of neural nets remains contained within separate problem spaces, the main goal of the machine learning community as a whole is to expand each of those islands and eventually merge them. And in fact, recent developments hint at such a convergence: for instance, the “transformer” (a kind of neural net architecture) recently crossed over from the natural language processing space into the image processing space — a unification of sorts.
There are downsides to programming by example though. Despite future optimizations, it might remain on average less efficient than imperative coding. More importantly, a piece of software that wrote itself via neural nets is fundamentally probabilistic: you cannot 100% guarantee that it will always produce the correct output. In theory, you can offer that guarantee with software 1.0, although people rarely go through the trouble of proving correctness of an imperative codebase. That said, those two downsides pale in comparison to the above advantages (cheap development and versatility).
Before we dive into our little example to illustrate all of the above, let me quickly reiterate the point I’m trying to make here: imperative programming and programming by example, while they currently may seem to address non-overlapping problems, might be more directly competing over problems in the future. Using neural nets to convert integers into Roman numerals (or vice versa) might seem silly, but I think it makes the point crystal clear.
Converting integers into Roman numerals
The description of the problem is fairly straightforward. I’ll just show a few example conversions:
4 -> IV
1193 -> MCXCIII
548 -> DXLVIII
3616 -> MMMDCXVI
21 -> XXI
Most people are familiar with Roman numerals. Unlike Arabic numerals, the value of a Roman numerical symbol does not depend on its position within a number. The symbol
300 has a different “value” than in
30. In the Roman system,
5— regardless of its position in the number. Roman symbols may also be locally subtracted as you parse them in order to figure out the final number. For instance,
XIV = X + (V — I) = 14. Arabic symbols are just added:
14 = 10 + 4.
Generally speaking, Arabic numerals are more compact, easier to compare, and way more convenient for doing arithmetic. Finally, it scales way better to represent large numbers. The Roman system would have to just keep adding symbols, making the whole thing more and more unwieldy and unparseable (thankfully, Romans didn’t seem to need to handle numbers larger than 3999).
Let’s write the imperative code to convert integers into Roman numerals. Later, we’ll contrast this with the “machine learning way” of doing this. For both approaches, I’ll use the Python programming language.
The imperative way
Let’s first define a mapping between Roman numerals and their corresponding integers, and then reverse the mapping.
Now onto the function itself. The main trick is to know how to repeatedly extract the most significant digit of the integer. To do that, you need to divide it by the nearest power of 10 that is still lower than that integer. The quotient of that division will be the most significant digit. Then you map it to its Roman equivalent with some simple rules. You then move on to the next most significant digit. In practice, you replace the original integer with the remainder of the above division and compute the most significant digit of that new integer (don’t forget to update the divisor to be an order of magnitude lower). Honestly, I don’t think this problem is straightforward for most people, which is probably why Leetcode considers it medium in difficulty (when I checked, only 60% of the submissions on the website were correct).
Let’s make a quick test set for sanity check purposes:
Programming by example
As discussed, programming by example requires collecting examples. So let’s assume we paid someone to do just that. For instance, that person asked 100 different people to come up with 20 examples. Of those 2000 examples, half will be used for the neural net to “write itself”. The other half will be used to check how well it’s doing. If there is room for improvement, typically it’s because showing the various examples once is not enough. Sometimes you have to go through the same examples a few times for the neural net to learn (we call these repetitions “epochs”). Let’s further keep 100 examples completely aside. At the end, when we’re confident the neural net has learned to solve the problem, we use the 100 examples held out to get an estimate of its accuracy. Remember: we mentioned that a downside of software 2.0 is that it’s probabilistic. The test set allows us to estimate how reliable the software is.
To code up the neural net, I’ll use PyTorch — a popular Python framework for deep learning. The first thing to do is to write classes to process the data and serve it to the neural net in the appropriate format. The
NumberDataset class reads the examples from a file. Each line in the file consists of example pairs:
$ head data/train.txt
We are going to consider both the Arabic and Roman numerals as arbitrary symbols, which together form a common vocabulary. Those symbols will therefore be treated as characters. For instance,
794 is a string of the following symbols:
VIII is a string of the following symbols:
I. All these symbols (
V) are part of our vocabulary. However, neural nets work with numbers, not strings. So you can’t just serve the net with strings of characters. This is where the
Processor class comes in: it maps the symbols in our vocabulary to unique integers. And that’s what we feed the neural net. One last thing: neural nets also typically expect their inputs to come in batches. Since, at the end of the day, neural nets are just a series of matrix multiplications with some nonlinearities peppered here and there, processing is made more efficient by using batches of numbers. That’s the role of the
collate function, which forms nice “boxes” of integers to feed the network.
Let’s now build our vocabulary, and then our datasets. PyTorch’s
DataLoader class just takes a dataset and serves batches from it.
Now on to the meaty stuff. We’ll use a sequence-to-sequence network directly inspired by the classic 2015 Bahdanau et al. paper. The architecture has three components: (1) an encoder, (2) a decoder, and (3) an attention mechanism in between. I won’t go into the details of these models, as the Internet is already replete with detailed explanations. Essentially, an encoder converts each symbol in a sequence into a multidimensional vector. A vector is simply a way to describe some entity. The more dimensions this vector has, the more granular the description can be. For instance the vector
["6'1", "brown", "75kg"] is a very rough description of myself that only encompasses the three dimensions of height, hair color, and weight. In this case, the “entity” is a person. In the case of our little problem, the “entity” is a numerical symbol. In both cases, the entity can be represented by a vector of numbers. Vectors are modular: if one of the descriptions in your vector is a non-number entity, it can itself be represented as a vector, and you just expand the original vector. For instance, “brown” is not a number, but “brown” can be represented as a vector of three numbers (intensities of red, green and blue), so that the final vector representing me is
["6'1", "red intensity of brown", "green intensity of brown", "blue intensity of brown", "75kg"]. Any non-number entity can eventually be represented by a vector of numbers. Were you to expand the vector, you could cram more information into it.
Anyways, the encoder simply converts a string of symbols into an array of vectors that capture information about those symbols, just like
["6'1", "brown", "75kg"] captures information about myself. The attention module then “summarizes” the information across those input symbols into a single vector to feed into the decoder. As the name implies, the decoder does the reverse process: it takes a vector as input and produces a symbol. In our case, it will produce a sequence of Roman numerical symbols. To get the decoder started, we need to feed it a special vector whose meaning is “start of the sequence”. If you look back at the code we use to build our vocabulary, that’s the symbol
<bos>. But on top of that seed vector, we need to influence the decoder with some information from the encoded input. The decoder needs to “look” at the encoded input to guide its production of symbols. Every time it spits out a new symbol, the decoder needs to pay attention to the encoded input. Say the input is
XIV and the decoder has already produced the symbol
1. Upon producing the next symbol, it probably needs to pay most attention to the second and third vectors in the encoded input (vectors representing
V). That’s because the decoder needs to subtract one from the other. This process of “paying attention to different parts of the input” is what’s known as… an attention mechanism. That’s the last component of our architecture. In practice, the decoder needs to “summarize” the information it pays attention to into a single vector. The way it does that is by doing a weighted sum of the encoded input vectors. In the case where the decoder is trying to produce
4 in the sequence
14, it will likely learn to give a low weight to
X in the input and a high weight to
Let’s now write some functions to train and evaluate our model. It’s useful to distinguish here between the loss function (also known as “criterion”) and the metric function. While they both give us a sense of performance, the former is used by the neural net to guide its “self-writing” process (technically known as “gradient descent”) and the latter is used by the human supervisor to evaluate how well the self-writing software is doing, like a foreman overseeing a worker in the factories of old. The criterion is typically more amenable to the gradient descent process but not easily interpretable by the foreman. The metric function, in turn, is more intuitive. In our case, the criterion is cross-entropy and the metric is accuracy (how many sequences does the software get right).
fit function encapsulates both the training process and the evaluation process. It repeats both processes until the neural net seems to have converged to a decent level of performance, which the foreman (us) can observe via the metric. The number of epochs just captures how many times the software goes through the same examples until it can’t get any better. The
patience parameter is used to decide when to stop training — essentially it causes the software to stop training when it stops making any progress for a few epochs. When the software is done writing itself, the
fit function spits out the final software.
Let’s instantiate our model. The
opt object is the thing that will run the gradient descent process for us. In other words, it is what manages the neural net’s “self-writing” process. This is also where we pick several so-called “hyperparameters” (
lr). Without going into the details, they determine either some structural elements of the neural net’s architecture (like the size of the vectors representing the symbols) or elements of the training process (like the rate at which the software updates itself based on additional examples).
Time to fit! (Only displaying the criterion and metric values at the last epoch.)
The validation metric is not great (82%). For a better sense of the confidence we should put in this piece of probabilistic software, let’s run it on our held-out test set of 100 examples unseen during training.
Same as the validation metric. Again, not the greatest. There are various reasons for that. First of all, we only used 1000 examples, which in the realm of deep learning is extremely small. If we collect additional examples, we will reach close to 100% pretty quickly. The right level of performance depends on the use case. The various cost savings of software 2.0 might be worth the accuracy hit. Of course, with this silly use case, a ~20% accuracy hit does not make any sense. (And again, the whole idea of using neural nets for this use case does not make much sense, this is just for illustration purposes.)
Let’s try a few examples to see the software in action. For that we build a
predict function that wraps the software with some light code needed to serve the input and to format the output. We show the results with our little test set. In parenthesis is the right answer.
4 -> I (IV)
1193 -> MCXCIII (MCXCIII)
548 -> DXLVIII (DXLVIII)
3616 -> MMMDCXVI (MMMDCXVI)
21 -> CXX (XXI)
We get 3 out of 5 right on our little test set.
Converting Roman numerals into integers
I mentioned that versatility is another major benefit of software 2.0. Let’s illustrate that by using the exact same neural net architecture to do a different but related task: converting Roman numerals into integers (instead of the other way around). To drive the point home, I first write the equivalent imperative program. Notice how completely different it is from the imperative program that converts integers into Roman numerals.
This was easier for me to code, but still took me some time.
Now let’s turn back to our neural net. This is where software 2.0 really shines. All I have to do to repurpose my neural net is to feed it different data. I don’t need to code anything. The only change I make is to add
target='integer' when reading the data. This ensures that the examples are served in such a way that the neural net learns to produce integers from Roman numerals.
IV -> 4 (4)
MCXCIII -> 1193 (1193)
DXLVIII -> 548 (548)
MMMDCXVI -> 3616 (3616)
XXI -> 21 (21)
The reverse program does better (88% accuracy) and gets all our 5 sanity checks right. However, 88% is far from perfect. Maybe things would be different with more examples. Indeed, if we add 4000 more training examples, we see improvements in both programs. The integer-to-roman-numerals program achieves 96% accuracy on the held-out test set, and the roman-numerals-to-integer program achieves a perfect score.
In this post, I’ve described two software paradigms: imperative programming (software 1.0) and programming by example (software 2.0). I’ve argued that, because they currently tend to focus on non-overlapping problems, they are not easily seen as substitutes of each other. I used the (silly) example of converting Roman numerals into integers and vice-versa to illustrate that substitutability. In the process, I’ve highlighted the pros and cons of each paradigm. Because software 2.0 is “self-writing”, the skills to “develop” that software (curate datasets of examples) are cheaper to source. Of course, the neural nets themselves have to be coded imperatively, but the argument is that they are destined to become commodities. And because they are more easily multi-purpose, the fixed cost of coding them is amortized over multiple problems. In the above example, this was illustrated by repurposing the same neural net to solve a task that, in the software 1.0 paradigm, requires a complete rewrite of the software. In the future, it’s likely that that it’ll be cheaper to do almost anything the software 2.0 way. This is because the cost of its relative inefficiency (more memory consumption, more compute, etc.) will become negligible compared to the savings in auto-writing that software and in substituting data curators for software developers.
You can find the notebook I used for this post here: https://github.com/cwarny/software-2.0