Machine Learning isn’t Learning

There’s a wide misconception about what exactly AI is and does. People use simplified language to express what’s happening, but this leads to impressions about stealing or deriving works which don’t really track with what’s going on under the hood. That is not to say all concerns vanish once we clarify what’s really happening, but only with an understanding of the nuance can we hope to tackle this complicated issue.

First, let’s talk about “functions”. In computer programming, a function is a block of code that you (and everyone else) can re-use whenever they want, rather that re-write it every time. For example:

ComputeAverage(num1, num2) -> result

As a programmer, you can “call” this function and give it two numbers, it will return to you the resulting average. You don’t care how it did it, it’s all abstracted behind the function. Now as it happens, this function is easy to write:

ComputeAverage(num1, num2) { result = ( num1 + num2 ) / 2 return result; }

But the point is, we don’t need to know how it’s written to use it. As long as someone somewhere wrote it, we can use it. Other function can be simple, such as:

CapitalizeName( name ) -> capitalized_name CropPicture( picture, new_height, new_width ) -> cropped_picture

Or they can be very complicated:

FindNextChessMove( board_state, turn ) -> next_move DoesPictureContainCat( picture) -> yes_or_no

As it happens, there is no programmer today who knows how to write a good version of DoesPictureContainCat, so keep that in mind as we’ll come back to it. In theory, though, such a function could be written, and if it was, we could call it like any other and it would do it’s job.

Now enter neural networks. It turns out, for theory-of-computation reasons, that many functions can be approximated by a neural network, some perfectly, most less so. For example, this network with 2 nodes and 1 bias will also compute the average of a number, even though it looks totally different from the ComputeAverage example above:

self.layer = nn.Linear(in_features=2, out_features=1) self.layer.weight.copy_(torch.tensor([[0.5, 0.5]])) self.layer.bias.copy_(torch.tensor([0.0]))

There are two aspects to the above network. First, it’s architecture/size. This is a linear network with 3 parameters. Second, are the value of the parameters, in this case 0.5, 0.5, and 0. What’s interesting is the exact same network, if initialized with different parameters, will do something else entirely. For most values what it does is either meaningless or useless, but for some it’s approximating (perfectly in this case) a totally different function:

self.layer = nn.Linear(in_features=2, out_features=1) self.layer.weight.copy_(torch.tensor([[1.8, 0]])) self.layer.bias.copy_(torch.tensor([32.0]))

Same network shape BUT by using different parameters (notice the 0.5 and 0.5 changed to 1.8 and 0, and then the 0.0 became 32), we’ve just created this function:

ConvertCelsiusToFahrenheit( c_temp ) -> f_temp

This is a critical point to understand. Given a model that has 3 parameters, there exists a “parameter space” (e.g., every possible combination of values those parameters could take) and every single point in that parameter space is some function.

Most of those functions are useless. Some of them are not. We have no idea which are which, nor can we “convert” from a parameter space function to a classically defined function. Nonetheless, we can stumble upon points in parameter space that are good (or perfect) approximations of classical functions. In the above case, we can convert temperatures and we can calculate averages, perfectly, from two different points in that parameter space.

Let’s return to my example above:

DoesPictureContainCat( picture ) -> yes_or_no.

As I mentioned, we don’t know how to make this kind of function, we don’t know how to “understand” what an image is just from looking at it’s pixels. It’s possible in theory… we would probably have to define millions of variations of what ears might look like, and assign probability weights to them, and then use algorithms to seek a nose or eyes, maybe some geometry with error bars to control for orientations of the face. Writing such a function would take tens of thousands of lines of code, be grueling to figure out, and require all kinds of probability insights that we just can’t derive ourself. In practical terms, it’s not possible.

BUT it IS possible, in theory, to write such a function.

And this is where things get crazy. Let’s take a much larger model, a CNN (this term isn’t important for the point) with 1M parameters. The parameters space for this is ENORMOUS. There are more “functions” in that space than there are atoms in our universe, and it’s not even close. Almost all of those functions are completely useless, but some of them won’t be. In a space that large, in fact, some of them will be absolutely incredible at the job of detecting cats in pictures. Who knows how they actually work, but by nothing else than the huge magnitude of the numbers and the knowledge that such a function is possible to write, we can deduce that many billions of version of the cat detector are approximated in this parameter space. Wouldn’t it be cool if we could find them, and even work our way towards the best one?

THIS is what training actually is. When you are training a model to detect a cat, or to generate a picture, or to predict the next word… you aren’t teaching it anything in the way humans learn. You are SEARCHING.

This was the critical insight of AI research: Given any function (a point in parameter space), it turns out there is math that lets you understand what “direction” you need to move to get closer to a better version of that function. So for example, we choose a random starting point for our cat detector, and we give it a picture of a frog and it spits out the number 0.98 which we interpret as “yes, this is a cat.” Well that’s obviously wrong, so this function we chose at random isn’t any good at doing the job we want. That’s not surprising, but now we can do some math, calculate what’s called the gradient, and we can determine what general “direction” to move from here to get a better result. We tweak the parameters of the model (which is the same as choosing a brand new function from the parameter space), and we try again.

So you see, at no point are we “teaching” it anything. We’re moving between increasingly-less useless functions — starting at random — and each time simply checking, “how good is this one at cat detection?”

“what about this one?”

Do that millions or billions of times, testing each one with example pictures of cats to determine how good it is, and with gradient descent we eventually find our way to the point in this vast multidimensional parameter space that holds the functions that actually CAN identify cats.

No learning, no copying, no studying… just searching. Now in the field this is still called “learning” because each subsequent candidate function has improved compared to the old one, and the direction we moved was informed by the loss against the data set, so it is as if it “learned” something from the data it saw in that training round, but this is mostly a semantic thing. At the most fundamental level: the data helped to steer / inform the search towards a version of the function that has a better intuition of that data to begin with.

Large language models use the same idea, but way bigger. Billions of parameters, and a latent parameter space that’s unfathomable… GPT4 exists in the parameter space of 1 followed by 4-trillion zeros (256^1,8000,000,000,000). When they train these models, they show it text — say, for example, the latest chapter from a story you posted online. They present a portion of this text to the model (model = a set of chosen parameters, or in other words, the current “function” they’re evaluating in parameter space) and ask it to predict what comes next. Based on how closely it produces the entirety of the chapter they calculate how far and in what direction they are going to jump to find the next candidate function.

Nothing absorbed your story, nothing studied it or recorded probabilities from it. There’s no mechanism for that. It was simply used to decide how close or far they are from a better version of the function they want (in this case, a version that would have done a better job predicting your specific text).

But think about what this means: right now, in latent parameter space, there exists a function that will produce an exact copy of a chapter you’re going to write TOMORROW. It’s the old, if you put a million monkeys in a room with typewriters, after a thousand years one of them will reproduce Shakespeare. The same thing is true here, but a billion times over. The parameter space is SO large that there already are functions whose output will match exactly to a chapter you’re sitting down to write right now. Whatever words you end up choosing, there’s already a function for that, somewhere out there. There are millions actually, each with close variations, some with significant variations. And there are also millions of LLM-style functions that have those words “encoded” in them, and with the right inputs might spit it out, too.

Likewise, there are functions that have the equivalent of all the text people ever wrote already in them — not because they were taught it or given it, but because parameter space is so large that essentially every combination of words exists somewhere, including the ones actually used in the real world.

And with modern machine learning and training techniques, we can find those configurations in a highly optimized way.

This is how all of them work. The image generators, the audio generators, the LLMs, this is what training means. We are running a very optimized search for functions in parameter space that do whatever it is we want them to do. We don’t care how they do it, nor do we know. We find them by using existing data as a kind of Doppler radar to see how close or far we are, at any given iteration, from a better version of the same function.

This is the only sense in which the work is “derived.” By using your posted story to help find a better function, they are biasing the search towards those vast quantity of existing functions that contain something like your story inside them already. Maybe this is a distinction without a difference, I’m not sure, but we can’t talk clearly about this when everything is abstracted by analogies.

Neural networks are function approximators.

Machine learning is a search algorithm to find the best approximators for whatever task we want solved.

The larger the model, the more parameters, the more functions can be appoximated in that parmaeter space (and thus the smarter and more thorough they might be).

Curving Space

Fantasy, Science, and Life

Machine Learning isn’t Learning

Leave a Reply Cancel reply