The Coming Age of Generalized AI
In this article I break down the most cutting edge, practical research for multi-function neural networks, aka AIs that can do a wide range…
In this article I break down the most cutting edge, practical research for multi-function neural networks, aka AIs that can do a wide range of tasks, unlike today’s AIs which are incredibly narrow and good at one thing only.
We live in an age of ambient AI.
AI is everywhere and all around you.
You can’t see it but it’s there. It screens calls for you, weeds out spam and harassing messages on social media, translates your voice to text, spots your friends and relatives in images, figures out what you want to buy before you even know you want it, helps you navigate to any place on Earth, and a thousand other little daily tasks.
And all of it happened in an incredibly short period of time. After decades of broken promises and failed expectations, we suddenly saw an explosion of production AI apps over the last 5 years. In so many ways, ambient AI is a strange beast because it’s not what we expected from the last 50 years of science fiction.
Sci-fi AI looked like one of three things:
AI gone crazy like HAL in 2001
Killer robots like in Terminator
Your friendly robot buddy like R2-D2
None of those really match how AI ended up developing in the real world. That’s because AI didn’t really exist when Star Wars, Terminator and 2001 got written. Writers just made up what they thought it would be like without any real world reference. AI in stories is basically just a person running around in a robot body, like the bad guy in an action movie with a metal skin. People like to read about people, so writers tend to anthropomorphize machines.
But probably the biggest difference with real world AI is that sci-fi AI is general purpose. It can do almost anything, just like a human being. Everything from fixing a broken spaceship so the good guys can escape in the nick of time, to talking fluently with people about any topic, to advanced strategic thinking as they outfox the bad guys in a daring game of cat and mouse.
But the real world is a bit different. We’ve got models that can smash the world’s best Go players, recreate a younger Luke Skywalker for TV and film, spot skin cancer from a smart phone, self-drive a car through the complex and chaotic streets of a big city, and trillion parameter transformers that can hold a conversation or translate documents.
The uniting force behind all of these amazing real world examples is that they’re all single purpose AI.
They’re what we call “narrow AI,” good at one thing and one thing only.
That means AlphaGo can play Go but it can’t drive a car or pick out cats in YouTube videos. DeepMind and other research labs have made strides towards generalized AI, like MuZero, a successor of AlphaGo that can learn just about any perfect information game, like Chess, Go or Atari games, but it still can’t and won’t hold a conversation or drive a car. It can be retooled to do things like improving video compression but the same instance that learned Chess can’t optimize video.
In other words, MuZero can’t learn multiple things at the same time.
Teach one instance of MuZero to play Go and it will crush everyone at Go. But then take the same instance and try to teach it chess and it will forget everything it ever learned about Go and get trounced by a ten year old amateur player. That’s called catastrophic forgetting and it’s one of the great weaknesses of our ambient AI wunderkinds. More on that later.
The real question is, can we ever build machines that will master multiple tasks, maintain long term memory, transfer their skills to new tasks and continually learn across their whole lifecycle?
For most of AI history the answer was a decisive no. Even a few years ago the answer was “theoretically but we don’t really know how to get there because something is missing.”
Researchers tried multiple ways to teach machines reasoning and general purpose skills and failed. Human intelligence is a black box. We don’t know how we do what we do. We easily learn to recite poetry and run a marathon and play video games but we’re not sure why we can do any of it. Our current approaches looked like a dead end.
But all that is changing fast.
Over the past few years, the top AI research institutes have turned their attention to the harder problems of AI:
Making a single AI that’s good at lots of things.
Even better, it’s not just hope and wishful thinking. They’ve hit on real world, practical approaches that put us on the cusp of a generalized AI breakthrough.
Of course, we’ve seen a lot of terrible predictions over the years that general purpose AI was just around the corner. All have proved absurdly optimistic, if not downright foolish. We’ve been on the verge of generalized AI for 30 years.
Don’t look now, but this time it just might be true.
And R2-D2 may be closer than we think.
The Most Promising Techniques for the Intelligence of Tomorrow
One thing before we go any further:
You may have noticed I didn’t use the term “AGI”, which stands for Artificial General Intelligence.
AGI is a loaded acronym.
It conjures up conscious robots able to act like human beings in every way, or superintelligent boogeymen that will take over everything and make us all obsolete. That’s not what I mean here. It’s a bit like using the word God. An atheist, a Christian, a Buddhist and a Muslim will all come to the table with built-in assumptions. The same happens with AGI which conjures up existential questions about consciousness and superintelligent machines taking over and a million other images that have little to do with reality but make for great stories.
What I’m talking about is something much more realistic than AGI. It’s attainable in the near future, with cutting edge techniques in development right now and that’s why I’m calling it “Generalized Intelligence” or (GI) to indicate it’s a big step down from AGI.
In short, Generalized AI (GI) is not AGI.
So what is it?
It’s an AI that can perform incredibly well in a single problem domain or a few domains.
Think of it as cat intelligence.
A cat is a remarkable creature. It can run fast, sleep in tiny boxes, find food and water, eat, sleep, purr, defend itself, climb trees, land on its feet from great heights and a hundreds of other subtasks. A cat won’t learn language or suddenly start composing poetry. That’s perfectly fine because a cat is really well suited to its set of tasks; it doesn’t need to build skyscrapers too.
Having a cat level intelligence is incredibly compelling. If you have a cleaning robot that can wash dishes, pick up clothes, fold them, carry them from place to place and iron shirts, that’s an incredible machine that people would clamor to buy. It doesn’t also need to write music, craft building blueprints, talk to you about your relationship problems, and fly a plane too.
“I find that level of animal intelligence, which involves incredible agility in the world, fusing different sensory modalities, really appealing. You know the cat is never going to learn language, and I’m okay with that.”
In other words, we’re talking about virtual or robotic systems that can perform a wide variety of tasks in one broad domain, like flying, driving, or cleaning without ever encountering a task that will make it completely fail or freeze up or break down. That’s generalized intelligence. And it’s going to revolutionize our world long before we ever have to worry about philosophical questions like do androids dream of electric sheep?
But how do we get there?
Researchers have hit on a number of different methods in the last few years that just might deliver us remarkably adept and versatile AI. This is truly cutting edge stuff with most of the major papers on these ideas coming 2019, 2020, and 2021. A decade ago researchers didn’t even know where to start and now they’re hard at work on realistic approaches that stand a great chance of making GI a reality.
The major new approaches are:
Retrieval and Rules based AI
Bi-Level Continual Learning and Progress and Compress
One of the two techniques, Bi-Level Continual Learning and Progress and Compress are two nearly identical techniques, developed in parallel by different research teams, that are essentially a combination of three techniques, progressive neural networks, elastic weight consolidation and knowledge distillation.
Many of these new ideas are coming to us from the world of robotics but they’re likely to have an impact on ambient AI even faster. Ambient AI is just software online. Robots have other problems that an online only AI simply doesn’t face. The real world is messy. There’s friction and gravity and tiny imperfections in surfaces and wind and dust. Just watch these amazing robots at Boston Robotics failing at Parkour, a challenging obstacle course, and you’ll see how easy it is for the real world to trip up machines (and people) with ease.
Let’s look at the hard problems these approaches are looking to solve, and then run through the various approaches to see why they’re potentially revolutionary solutions.
Memory, Progress and the Novelty of the Never Before Seen
Retrieval and Rules based AIs are trying to solve one of the deepest problems in deep learning and that’s that deep learning models are really dumb. GPT-3 is a massive language model that can mimic speech incredibly well. It sounds human and it can write sentences that are incredibly lifelike. But there’s no real comprehension to what it’s saying. It has no grasp of reality.
It might write sentences that are really compelling about physics or cars or relationship problems, but it doesn’t know a damn thing about physics or cars or relationship problems as a concept the way a person does. All it’s really doing is predicting the next best word in a sentence and that is surprisingly powerful but not very smart. It might answer a question in a way that sounds perfectly reasonable but is just totally wrong. It can’t link together concepts and it has no boundaries. In other words, it just mimics language but it doesn’t really know what it’s saying.
DeepMind looked to address that recently with its RETRO model, which stands for Retrieval Enhanced Transformer, what I’m calling a retrieval system. Basically it combines the system with a database of information it can look up or retrieve to craft more compelling answers. As its corpus of knowledge expands, so does its ability to know what it’s talking about and answer correctly.
The RETRO model has 25X less parameters as GPT-3 and yet performs as well or better on writing compelling answers. But it’s not limited to what it learned in training. As DeepMind writes on their blog, the model “has access to the entire training dataset through the retrieval mechanism. This results in significant performance gains compared to a standard Transformer with the same number of parameters. We show that language modeling improves continuously as we increase the size of the retrieval database, at least up to 2 trillion tokens — 175 full lifetimes of continuous reading.”
It’s not hard to imagine an expansion of this technique to other types of neural networks. If a model can look up information, the same way we use a search engine, it can perform much better on a number of tasks, like writing code or answering questions or understanding the context of a conversation. While that doesn’t give us machines that truly understand what they’re talking about, it does give them a knowledge base to work from and that gets us closer to giving it a baseline grasp of reality.
OpenAI, the creators of GPT-3, used a different technique to make GPT-3 smarter, called Instruct GPT-3. Basically they use human-in-the-loop reinforcement learning to tell GPT-3 which of its answers are more accurate and compelling. That also produced a radically more simple model, with only 1.3 billion parameters versus GPT-3’s 175 billion parameters. The smaller model makes up facts a lot less and showed a decrease in toxic output.
Again, it’s not hard to see a combination of these techniques producing a much better model that’s smaller and more efficient. Give the model the ability to look up information and teach it through reinforcement learning, which is essentially reward and punishment training for machines, and you start to have a model that “stays on the rails” and does more of what you want, more often, especially in complex environments. If you limit its range to smaller domains, it becomes easier to control, safer and “smarter.”
Lastly, perhaps the most promising approach to combining rules and deep learning comes to us from Google researchers with their DEEPCTL training system. One of the system’s biggest benefits is that it doesn’t need any retraining to adjust rule strength. Rules can be adjusted at inference time to make sure it stays on its guardrails if it starts making poor decisions. Even better, it’s totally agonistic to data type, which means it should work just as well for unstructured data like videos and medical images as well as text. Google is testing it in areas where rules are utterly essential like physics and healthcare where an out of bounds interpretation can result in explosions, disease and death.
Ultimately, Google’s approach may lead us to the holy grail of combining symbolic logic and heuristics with deep learning. It’s one of the first methods that combines rules and deep learning in a way that lets people adjust those rules later and it could prove the breakthrough we need to keep deep learning models from inventing nonsense that sounds good but is totally wrong the way GPT-3 does now.
That gets us to our second set of techniques, Bi-Level Continual Learning and Progress and Compress.
Generalized Neural Nets and the Weights of Memory
If Retrieval and Rule based systems are trying to make AI smarter and more consistent, then Bi-Level Continual Learning Progress and Compress are looking to solve the hardest problem in AI, and that’s making AI good at lots of things. To do that it has to solve a problem we hinted at earlier, catastrophic forgetting.
Remember what we said about MuZero earlier. Train MuZero on Go and it will get superhuman at playing go. Now take the same instance of MuZero and train it on chess. What happens? It gets incredible at chess and forgets everything it ever knew about Go.
It’s not a problem of storage or memory. It’s intrinsic to the way neural nets work today. Basically, neural nets are a bunch of neuron-like nodes arranged in layers that link together in a way that’s similar to connections between synapses in the brain. Each of those connections has “weights” which are roughly equivalent to how important that connection is for the network to do its job.
Before a neural net can perform a task like recognizing cats in pictures, it needs to get trained. The weights in the network get randomized when it starts training and that makes it not very smart. In the beginning the network is pretty terrible at recognizing cats. But you feed it a bunch of data that you’ve labeled with “cats” and “not cats” it learns over many training cycles, called epochs, how to recognize cats by adjusting its weights. Eventually, after enough training the network is great at recognizing cats and its weights are more or less fixed and we get a “trained model” that’s ready to go to work.
Now you can point the model at images and have it render a verdict on whether the picture has a cat in it. It’s a lean, mean, cat recognizing machine.
But the problems start if you now want to take the trained model and have it learn to recognize dogs. As you train it again, all those weights will get adjusted to make it more successful at recognizing dogs. But the weights that show the network how to recognize cats are now gone, overwritten with the new weights.
One of the ways researchers like Hadsell use to fix catastrophic forgetting is “elastic weight consolidation.” It’s basically a way of partially freezing connections in the model. The technique figures out which of those connections are most important. It turns out to be a surprisingly small number most of the time, maybe 5% or 10% and then keeps them mostly frozen in place so they can’t change. Now you can retrain the network and it can learn new skills.
If you’re paying close attention, you probably noticed a big potential problem with this technique. Eventually most of the neurons will get frozen and the network won’t be able to learn anything new. It’s analogous to the way we learn. As babies our brains are very plastic and we can learn seemingly anything very fast. But over time we have experiences and our mind gets “set” and our connections get reinforced and we have trouble loosening those connections and adjusting how we think or what we can learn. It’s why kids can learn languages with lightning speed but it gets much harder as an adult if you don’t have an aptitude for it.
But Bi-Level Continual learning has the potential to fix those issues. Hadsell calls her very similar approach “progress and compress.”
Both techniques create two neural networks, a fast learning network and a base model. That roughly mirrors the functioning of our brain yet again. Think of it as the hippocampus and neocortex. As Hannah Peterson writes in her article on catastrophic forgetting, “In our brains, the hippocampus is responsible for “rapid learning and acquiring new experiences” and the neocortex is tasked with “capturing common knowledge of all observed tasks.” That dual network approach is called a progressive neural network.
The fast neural network is smaller and more agile. It learns new tasks then transfers the finalized weights to the base model. So you end up with a lot of stored neural networks good at a bunch of tasks.
But there’s a problem with basic progressive neural nets. They don’t share information bi-directionally. You train the fast network on one task and freeze those weights and transfer them to the bigger network for storage but if you train the network first on recognizing dogs, it can’t help the new network training on cats. The cat training starts from scratch.
Bi-Level Continual Learning and Progress and Compress both fix that problem by using a technique called knowledge distillation, developed by deep learning godfather Geoffrey Hinton. Basically, it involves averaging all the weights of different neural nets together to create a single neural network. Now you can combine your dog trained model and cat trained model and each model shares knowledge bi-directionally. The new network is sometimes slightly worse or slightly better at recognizing either animal but it can do both.
The one downside of knowledge distillation is that it brings the problem of catastrophic forgetting roaring back. If you constantly average out the storage neural net, you destroy it’s old connections. So if you average the dog and cat network together and add a new penguin recognizer, you might drastically reduce the cat and dog capabilities.
Progress and compress addresses that by bringing elastic weight consolidation back into the picture. As Tom Chivers writes in Spectrum IEEE, “Each time the active column transfers its learning about a particular task to the knowledge base, it partially freezes the nodes most important to that particular task.”
Even better, the progressive neural net and knowledge distillation approaches avoid the problems of elastic weight consolidation, which is that all weights will eventually freeze. You can make the storage network massive and so a few frozen connections don’t matter that much and the fast network can be much smaller and more agile.
Other researchers are using different techniques to solve the major challenges of machine learning.
An alternative to the Bi-Level Continual Learning/Progress and Compress approaches is called Replay. The technique takes inspiration from how we dream. At night, we essentially replay the most important images and experiences we’ve seen during the day to strengthen those connections faster. “iCaRL: Incremental Classifier and Representation Learning” and REMIND networks use variations on that technique to overcome catastrophic forgetting and Chivers notes that Ted Senator, a program director at DARPA, the Defense Advanced Research Agency, is using replay techniques in their SAIL-ON project, which is short for “Science of Artificial Intelligence and Learning for Open-World Novelty” a project designed to teach machines how to adapt to changing rules and changing environments.
The Economist called DARPA the agency “that shaped the modern world” and noted that “Moderna’s covid-19 vaccine sits alongside weather satellites, GPS, drones, stealth technology, voice interfaces, the personal computer and the internet on the list of innovations for which DARPA can claim at least partial credit.” They focus on high-risk, big reward, moonshot style projects that deliver breakthroughs.
When it comes to AI, the research agency’s pedigree is long and strong. They pushed research teams to the limits with their original DARPA grand challenge, calling on teams from around the world to compete to get autonomous vehicles through a 150 mile race course in the Mojave desert. In the first year, 2004, the winning team barely made it 7 miles before crashing. Most of the teams couldn’t make it past the first mile and a few didn’t even make it out of the starting gate. By 2005, five teams had finished the race. Now they have grand challenges for robotics and they routinely push the cutting edge with everything from digital technology to biotech.
Replay is just one of the technologies DARPA wants to use to make intelligence agents more intelligent. The SAIL-ON project is about a lot more than memory and learning different tasks. It wants to solve the biggest problem in AI:
How does a machine deal with totally new situations or situations that have changed dramatically from the training data?
If you were paying close attention, you might have realized that all the techniques we talked about still have one major flaw. They require lots of training. Every task you want an ambient AI or a robot to learn means you’ve got to teach it to do that task. You can do it through reinforcement learning, or semi-supervised learning or through genetic algorithms, or through fully supervised learning. But with every one of those techniques you’re still doing the training.
The one thing those systems can’t do is adapt intelligently to a novel situation.
If you want your house cleaning bot to do the dishes and vacuum and fold your laundry, you have to teach it about each action separately. But if you want it to cook it won’t just be able to pick up cooking on its own. It will need more training. Perhaps that’s not a huge problem, because the system could always pull down new tasks from a cloud catalog and integrate it into it’s long term neural net — but it’s still inefficient.
In the end, what we really want are systems that can adapt to brand new situations and not freeze up while they wait for a download. They should be able to use what they already know to figure it out.
There’s a chance that these generalized systems may exhibit some behavior of adaptation just by having a huge number of similar or diverse tasks in their memory banks, but there is no guarantee that more complex behavior will simply emerge from the primordial ooze of complexity. In the end, those systems may learn dozens or hundreds or even thousands of tasks but there is no guarantee they will be able to do anything they haven’t encountered before. In all likelihood we’ll need new algorithms that can deal with uncertainty.
That’s what DARPA’s SAIL-ON is really about. It wants to teach systems to adapt to brand new situations the machine has never seen and take action, creating a new policy net on the fly. DARPA’s call for researchers says:
“The focus is on novelty that arises from violations of implicit or explicit assumptions in an agent’s model of the external world, including other agents, the environment, and their interactions. Specifically, the program will: (1) develop scientific principles to quantify and characterize novelty in open world domains; (2) create AI systems that act appropriately and effectively in open world domains; and (3) demonstrate and evaluate these systems in multiple domains, including a selected DoD domain.”
As Senator puts it: “Imagine if the rules for chess were changed mid-game. How would an AI system know if the board had become larger, or if the object of the game was no longer to checkmate your opponent’s king but to capture all his pawns? Or what if rooks could now move like bishops? Would the AI be able to figure out what had changed and be able to adapt to it?”
The bigger goal of the SAIL-ON project is to remove the training step for a new situation. The model would recognize the rules had changed and then successfully respond to those novel conditions. That doesn’t eliminate training all-together but it stands the biggest chance to keep the machine from going TILT when it sees something it’s never seen before.
That’s a lot like the cat intelligence we talked about earlier. Even if you raised your cat from a baby in one house and never showed it anything of the outside world, then you suddenly took it outside and put it in a field by a river it wouldn’t freeze up. It might get scared and look for a place to hide at first but it will figure out how to deal with the grass and the water even though it’s never seen either in its entire life.
That ability to adapt to novelty is something humans do incredibly well too. If I take you outside and teach you how to play catch for the first time, you might not be that good at it but you’ll understand the basic rules and you might even catch a few balls with no previous experience of catching balls. That’s still out of the realm of possibility for even the most advanced AI’s today and that’s what DARPA wants to fix. Just like humans, maybe those adaptable intelligences won’t be pro baseball players their first time playing catch, but they won’t freeze up. Adding in training afterwards could get them to another level, just like a lot of practice might turn that kid into a world class batter or pitcher.
They may even teach machines to abstract.
That’s the one real talent that humans still have over the machines.
A machine can learn to recognize knives but won’t suddenly recognize a jagged rock or a claw as dangerous without learning about rocks and claws too. But a human can get cut one time, feel that pain and then abstract the concept of “sharpness” from the experience. When they spot a jagged rock or a railroad spike or anything else that looks sharp they’ll avoid it because they understand it has the same properties as the knife and it’s going to hurt them.
It’s likely that in the end, tomorrow’s machines will use a combination of all of these techniques and more.
AlphaGo was a combination of three existing ML techniques, reinforcement learning to reward it or punish it for winning and losing, Monte Carlo analysis to deal with novel gameplay, and neural nets that learned from supervised learning. The reinforcement learning framework helped the bot develop a policy network or a set of possibilities in every situation. When it didn’t have a perfect move from that policy network, because there was no way to simulate every possible Go situation, it used Monte Carlo analysis to simulate new possibilities and make a quick decision.
Expect tomorrow’s machines to combine everything from reinforcement learning, to Bi-Level Continual Learning, to elastic weight consolidation, database lookups and Monte Carlo analysis and whatever brilliant new novelty response algorithms come out of DARPA or another team’s research.
The AI’s of the next decade will have an incredible amount of range and capabilities that make today’s machines look like early mud huts next to a skyscraper.
Coming Soon to a Robot Near You
My biggest takeaway from all this reading and research is that advanced AI work is happening and it’s progressing a lot faster and holds a lot more promise than most folks realize.
Most of this is not pie in the sky. It’s already getting tested and put to work in labs around the world.
That’s likely for two reasons.
The first is that we’re not starting from scratch.
Now that we have real world success with AI, there’s something to build on. It was hard for someone to invent the first light bulb, but once it exists it’s much easier for other researchers to build on that original idea, to make the light bulb longer lasting or cheaper or brighter. Totally original ideas are rare, but once an idea exists many brilliant minds leap in to expand it and refine it.
The second is resources.
AI is big business and the payoff for a real breakthrough is huge. Every business on Earth will transform dramatically with smarter AI that can do many tasks really well. The range of applications is massive, everything from trading, to fraud detection, to home cleaning machines, to adaptive defense computer networks, to ambient AI that can spot disease and recommend treatments better and faster. And that means researchers are now backed by tens of billions of dollars and they have the time and computational power to pursue new approaches that can lead to real breakthroughs instead of just dreaming about them or thinking about them.
While some of the techniques we talked about here might not pan out or get replaced by newer, better ideas, it still looks like we’re on the verge of a real leap forward in intelligent systems. A decade or two from now, catastrophic forgetting will be forgotten and robots that can care for an elderly person in an old age home start to look like a real possibility. We’ll have code-writing ambient AIs that can craft not just mid-level programmer code, but ones that can invent novel code and solve new problems by working closely with top tier research programmers in tandem.
And while we may never have a robot as awesome as R2-D2 in our lifetime, we may just have one that can clean the house and do the dishes and maybe even make cute beeping noises along the way.
I’m an author, engineer, pro-blogger, podcaster, public speaker. My upcoming book, Mastering Depression and Living the Life You Were Meant to Live tells the story of how you can battle the dark forces of existence and still find a way to live a big, bold and beautiful life anyway.
You can join my private Facebook group, the Nanopunk Posthuman Assassins, where we discuss all things tech, sci-fi, fantasy and more.
If you love my work please visit my Patreon page because that’s where I share special insights with all my fans.
Top Patrons get EXCLUSIVE ACCESS to so many things:
Early links to every article, podcast and private talk. You read it and hear first before anyone else!
A monthly virtual meet up and Q&A with me. Ask me anything and I’ll answer.
Access to the legendary Coin Sheets Discord where you’ll find:
Market calls from me and other pro technical analysis masters.
The Coin’bassaders only private chat.
Behind the scenes look at how I and other pros interpret the market.