There’s a lot of discussion about where software is heading and what the future will look like with all the buzz generated by large AI models like GPT. With simple text prompting to an AI model, it’s now possible to generate human-indistinguishable text, surrealistic images, functional code, inspirational music, and more. This is all Software 3.0.
Software 3.0 represents a fundamental shift in software engineering and development, where the new programming language is simply “English”.
In this post, I provide insights using an analogy to semiconductor chips & early computing (note that this post is purely about software). This post was originally written before GPT-4 release. Knowledge cutoff date 28 Feb 2023.
Software 1.0, the "classical stack," is the programming paradigm that we are all familiar with. It involves writing source code in languages such as Python, C++, and others, which is then compiled into explicit instructions for the computer to execute. In Software 1.0, programming involves writing every line of code to achieve a desired behavior.
This paradigm has been the foundation of software development for many years, and while it is still widely used today, it's important to note that there are newer paradigms emerging that differ significantly from Software 1.0. These are Software 2.0 and Software 3.0.
Software 2.0, the "neural network stack," (see Andrej’s great blog) is written in much more abstract languages that are less accessible to humans, such as the weights of a neural network. This code is not written by humans, as the vast number of weights requires using optimization, often through backpropagation and gradient descent, using a dataset of desired inputs and outputs. In Software 2.0, the code usually consists of a dataset that defines the desirable behavior and a neural net architecture that gives the rough skeleton of the code, but with many details (the weights) to be filled in. The process of training the neural network compiles the dataset into a program with the desired behavior — the final neural network.
In contrast, Software 3.0 is written in simple natural language, and a desirable program is obtained by querying a large AI model that can accomplish a broad range of tasks (also called a Foundation Model) with an input prefix and passing it some input and output examples, all in natural language, without any human coding or optimization involved. In Software 3.0, programming involves “prompt engineering” to design a well-formed natural language prompt to query the model to get the desired output.
It is significantly easier to simply query (or prompt) a generalized AI model to get a desired behavior than to collect the data to identify the desirable behavior and train a neural network (2.0) or explicitly write the program (1.0). Because of this ease of use, along with many other benefits of Software 3.0, we are witnessing a new trend across the industry where a lot of 1.0 & 2.0 code is being replaced by 3.0 code.
AI (Software 2.0) is eating software and now, human-interpretable AGI (Software 3.0) is eating AI.*
*Here, AGI refers to a “generalist” AI agent that shows very good zero-shot performance on a multitude of tasks for a particular data modality, and this post avoids getting into what qualifies as true AGI or not.
Transitioning to Software 3.0: A New Era of AI
The ongoing shift from Software 2.0 to 3.0 has brought about significant advancements in various domains. Instead of training specialized neural networks for individual tasks, this new paradigm focuses on utilizing a single, large-scale AI model that excels in a range of tasks for each domain, effectively incorporating it into the 3.0 stack. Let's explore some specific examples that demonstrate the impact of this transition:
Natural Language Processing. In the past, different models were employed for diverse tasks such as translation, understanding, and summarization. However, the field has now converged to using a single, large AI model like GPT-3, capable of handling all these tasks.
Image Synthesis. Earlier image synthesis techniques focused on models like Generative Adversarial Networks (GANs) and specialized neural networks to generate images in a particular style or category. However, the advent of Stable Diffusion models has brought about a significant shift in this domain. These models enable high-quality, diverse, and controllable image synthesis of almost any scene and style conditioned on natural language.
Visual Understanding. The landscape of computer vision has been rapidly evolving in the Software 3.0 era. Previously, Convolutional Neural Network (CNN) architectures, often pre-trained on extensive datasets like ImageNet and fine-tuned on smaller custom datasets, dominated this field. However, recent breakthroughs have introduced more powerful and scalable architectures such as Vision Transformers (ViTs), methods like CLIP, and Visual Large-Models (VLMs) like Flamingo, which enable zero-shot visual question-answering, recognition of new objects & scene understanding all using natural language.
Speech Recognition. The days of training multiple neural network architectures for specific languages and accents are fading, as the field converges on a single, large model like Whisper. This unified approach offers exceptional multi-lingual recognition capabilities.
Speech Synthesis. Before, large ConvNets like WaveNet were utilized to generate raw audio signal outputs of generic human speech with limited tones and accents. Nowadays, models like VALL-E can synthesize personalized human speech from any three-second audio recording.
The overarching theme across these domains is the replacement of specific models tailored for individual tasks with a single, large AI model capable of handling multiple tasks and exhibiting impressive zero-shot generalization. This theme is not limited to the field of AI and spreading into other fields such as Robotics, Physical Sciences, Medicine, Finance, etc. The transition to Software 3.0 is revolutionizing the way we approach problem-solving and broadening the scope of AI applications.
Building Software 3.0: The Future of AI-Driven Computing
In the world of Software 3.0, large AI models like GPT-3 have become the backbone of the new computing paradigm. These models are limited by a context length size such as that of 4096 tokens. There is a natural way of thinking of this in Software 3.0 using an analogy with 32-bit instruction CPU chips, where:
Natural language is the new machine code (akin to Assembly language)
GPT is a “Neural Compute Unit” (NCU) with a 4096-natural language token instruction set (akin to a Compute Processor)
In traditional computing, high-level program code is compiled into fixed-sized machine language instructions, which are then sent to a processor to carry out the intended operations. This process of relaying fixed-sized instructions to a compute processor bears a strong resemblance to providing 4096-token natural language instructions to an AI model such as GPT in Software 3.0.
In this paradigm, a single 4096 token instruction contains an input prompt, some in-context example data, along with empty space for the neural compute unit (GPT) to write its output back (this is similar to how “punch cards” were used in early computing as a means of all input, output & storage). With GPT functioning as a neural compute unit in this analogy, the question arises: can we build a full neural computer with it?
Building a neural computer in Software 3.0
To build a neural computer, we can begin by recreating the basic components of a traditional computer, such as memory, caching, variable registers, and more.
Instruction Sets: We need powerful instruction sets to use with a neural compute unit such as GPT. These encode the format of the input & output to the AI model and enable powerful interaction often with intermediate steps. Examples of such early instruction sets are chain-of-thought (CoT) and REACT.
Variable Registers: We need to be able to load input variables and store output variables while interacting with the AI model. This can involve techniques like retrieval.
Caching: The caching of inputs and outputs is an important aspect. Conversation buffers, for instance, can be seen as a means of caching in Software 3.0.
Logic: It’s likely that simple and 100% verifiable Software 1.0 will continue to exist, but will be operated as tools by Software 3.0 code. Examples of such simple tools include calculators, calendars and basic functions.
Memory: Storing data is crucial in a computer. Mechanisms like embeddings can be used to store and retrieve data efficiently from vector databases that serve as memory.
These are just examples of early precursors of components needed to build a full Neural Computer, and we will likely see a lot of progress here. Even when we have assembled the low-level layer, we will need to build the user interface for interacting with such a neural computer, as well as an operating system (OS) which here might work as an AI assistant to schedule tasks and manage the state of the neural computer.
How do we build a Software 3.0 program?
(This section gets pretty technical and may be skipped by a casual reader)
Once we have assembled all the components of a Neural Computer, we can start thinking of programming it. How do we do this? We can follow the same procedure as a classical computer to compile a high-level programming language (like Python) to Machine code.
Thus, we could build high-level abstractions that can work on arbitrary-length inputs and outputs, and compile them into 4096-token instructions that can be fed to the neural compute and combined with mechanisms like memory retrieval, caching, logic, etc.
To build this high-level abstraction, we can define Objects that store natural language.
Natural Language Indices (Objects)
These act as a structured store of arbitrary-length natural language constructs. A natural example is a book: a hierarchy tree with a root starting as the cover, the first level corresponding to chapters, the second level to pages, the third to paras, and so on. In general, we can define index structures that organize natural language constructs in linear, hierarchical, or any other structure. LLamaIndex is an example library to accomplish this in Software 3.0
Similarly, we can define Functions to operate on natural language constructs.
Natural language Functions (NLf)
We can think of this as a function or an operation that takes in an arbitrary length natural language input and outputs an arbitrary length natural language output. What does an NLf look like? Simple examples: summarize, rephrase, complete. We can also mix more complex mechanisms such as retrieval, and examples may be search, use [tool], etc. and include methodologies such as Toolformer for API calls to use external tools. We can also define recursive NLfs and these may include reasoning traces, intermediate chains, validation, and error-correction mechanisms.
Current implementations of such functions are limited to a single instruction within the 4096-token limit, but a high-level implementation could handle arbitrary sizes (and use compilation to create sequences of 4096-token instructions and do fetching & storing as required) to accomplish long-range tasks such as summarizing a complete book or even writing a full novel series.
Once we have built objects & functions for high-level Software 3.0 programming, it’s easy to combine them to form complex Software 3.0 programs.
Ideally, a Software 3.0 program should store states in objects (indexing data structures) instead of the neural compute (GPT) to not risk losing information, and use functions that leverage the neural compute to transform states and reach the desired result.
Compilation vs Interpretation
Instead of compilation from high-level to unstructured natural language 4096-token instructions, we can convert to a structured, and potentially verifiable, intermediate language (see Parsel). These intermediate languages may offer greater verification, safe execution, and portability, allowing the same 3.0 code to run on different neural computes such as Cohere or Anthropic AI models. This may be independent of the specific prompting techniques or mechanisms required for converting the intermediate language to the low-level instructions, similar to how the same Python code can run on different machines.
Moreover, instead of running on neural computes, these intermediate languages can also be made to target arbitrary platforms, such as compiling to a programming language like Python to run on regular compute chips, or CUDA to run on GPUs, or entirely new languages & commands that can interface to diverse systems.
Where can we go with Software 3.0?
In building a neural computer, we can see we have built the bare sketches of the working of a “brain” similar to the biological brain that combines similar mechanisms together (long and short-term memory, neural processing, logic circuits…) although only for the language modality yet!
Thus, building neural computers may be seen as being on the right path to human-level AGI systems.
Where does Software 2.0 fit in?
In this new software paradigm, Software 2.0 can be seen to fit as the “fine-tuning” process, whether through supervised learning or RL with human feedback, to adapt & customize a stock neural compute unit to a particular use-case and dataset.
Thus, the processing power (FLOPs) of the neural compute remains fixed, but this capacity can be adapted to perform better on specific use cases by unlearning irrelevant functionality. This learning & unlearning capability is the biggest benefit of AI-driven neural computing over traditional computing.
Benefits of porting to Software 3.0
Why should we port complex programs into Software 3.0? A big reason is we can make use of the innate properties of natural language. This makes the software not only easier to write for us but also allows it to generalize better across different tasks and domains. Some concrete pros & cons:
Pros:
Human Interpretability. Software 3.0 is human-interpretable and simple to write as its inputs & outputs are natural language rather than array of numbers, offering greater transparency and interpretability in comparison to Software 2.0. Moreover, in Software 3.0, the AI model remains fixed and can be communicated over APIs, and its training & implementation details are mostly irrelevant to the user.
Generalization & Simplicity. Software 3.0 can capture the functional long tail that has been hard to reach in earlier paradigms. In traditional 1.0 code, capturing all possible functionality for a simple use case such as identifying an object in an image can involve 1000 lines of if-else conditions, or in 2.0 code, collecting a dataset with 1000s of images and training a neural network, whereas 3.0 code it can be accomplished with a single prompt to a large AI model!
Framework Agnostic. A big benefit of software 3.0 is unifying a bunch of underlying frameworks (e.g. programming languages) under a single interface of plain English.
No Gradient-based optimization. Compared to Software 2.0 where training a single neural network can take 100s of epochs and multiple days for optimizing the network weights with backpropagation and gradient-descent, 3.0 code is gradient optimization free, requiring a single inference call to a large AI model.
Lower barriers. Software 3.0 can be operated by a broader range of users, regardless of their technical expertise, making it a more accessible and versatile solution in the ever-evolving world of software development, thus leading to saving overall engineering time and budget. Programming in Software 3.0 is almost a linguistic exercise of writing concise & well-specified English, and a newer generation of programmers might be more linguists rather than classical coders.
Cons:
Fault tolerance. It’s currently an open question on how fault-tolerant & reliable Software 3.0 code can be. A general concern with using any black-box AI model is errors, and perhaps it will be even more relevant if we're treating Natural Language + Large Language Models (LLMs) as the new Software abstraction; the greater the number of calls to an LLM, the greater the probability of error within a program. There may be some use cases where this is totally acceptable or can be mitigated. And this provokes the question of what may be some good automated error correction mechanisms in Software 3.0.
Limitations to Expertise. Prompting a pre-trained model is likely not sufficient to reach expert-level domain expertise and finetuning along the lines of software 2.0 may be potentially mission-critical to certain applications and become more prominent in the future.
Prompting finickiness. Software 3.0 program behavior can easily & unpredictably change due to a slight word placement or grammar change. With great power comes great responsibility, and with a single grammatical mistake, a software system that uses Software 3.0 could entirely malfunction.
Latency and Cost. Another potential "con" of Software 3.0 is the latency and cost. Assuming we access LLMs through a remote API and assuming context windows get bigger (e.g. to >=50k tokens), users may need to think carefully about stuffing the context window full of tokens vs. being more mindful of the potential latency/cost implications. Writing any software program requires users to be mindful of system resources, such as CPU/RAM/disk space, and this is similar in that regard; just maybe with a slightly different set of concerns.
What does the future look like?
Software 3.0 is poised to transform the way we build and interact with technology. At the heart of this revolution lies AI-driven development to enable seamless, human-like software interactions. A single software component will likely comprise an AI interface with an internal program state, and natural language will be the connective tissue for software stacks, databases, frontends, backends, and API calls.
Humans will communicate with AI using natural language and AI will operate machines allowing for more intuitive and efficient operations. We are increasingly seeing this shift for web browsers with MULTI·ON and Adept AI automating browsing using user natural language input. In robotics, Google Robotics is leading the charge in creating robots that understand and respond to human language. Similarly, AI-driven synthesis is being harnessed across various creative domains. Music, video, games, and movies are all experiencing a renaissance as AI becomes capable of generating complex and engaging content in response to human input, and the line between human creativity and machine-generated artistry is becoming increasingly blurred.
Recognizing Software 3.0 as an emerging programming paradigm, instead of simply treating large AI models like GPT as powerful text-generating tools, the extrapolations become more obvious, and it’s clear that there is much more work to do.
Current tools assist humans in writing 1.0 and 2.0 code, such as IDEs with syntax highlighting, debuggers, profilers, and code and dataset versioning. In the 3.0 stack, programming involves composing and refining natural language prompts. When an AI model produces undesired output, the issue is fixed by enhancing the prompt or context. The first Software 3.0 IDEs may help with crafting, testing, and refining prompts, possibly suggesting alternatives or providing real-time feedback on clarity and effectiveness.
Similarly, GitHub & Huggingface are very successful homes for Software 1.0 and 2.0 code respectively. Is there space for a Software 3.0 Github? In this case, repositories could store collections of prompts and model configurations, and commits consist of refinements and expansions of the prompt library.
Mature libraries like Pytorch for model training and services like Scale AI for data collection and labeling have facilitated Software 2.0 coding. It remains to be seen if upcoming libraries like Langchain can bridge this gap, and what additional services can be developed to ease Software 3.0 coding.
For deployment and user interaction with Software 3.0, it's unlikely that the final product will be a single chat interface. We will see specialized copilots, agents, and assistants for various cognitive tasks emerge to reduce the human workload.
In the near term, Software 3.0 will become increasingly prevalent in every domain where a 99% model accuracy is not mission-critical and designing an explicit algorithm is challenging. This presents numerous exciting opportunities in considering the entire software development ecosystem and how it can be adapted to this new programming paradigm. In the long term, the future is really promising: as we approach AGI, Software 3.0 will be at the forefront, with even fictional AI systems like JARVIS potentially within reach.
Some predictions:
We will develop a universal token language for Software 3.0 (the new x86): This will be the new instruction set architecture that will define AI-driven computing. It will be:
Multimodal: mixes any modality tokens + boolean logic tokens + other auxiliary tokens
Incorporate Retrieval and tooling: similar to read & write to memory in CPUs
Include Reasoning & Interpretability traces
OpenAI becomes the “neural compute” maker (Intel) and ships a new GPT about every year
Final remarks (after the cutoff date):
GPT-4 with a 32K token length and multimodal inputs increasingly looks towards reaching a universal token language. The ChatGPT plugin ecosystem adds retrieval and tooling mechanisms through API usage, and with more maturity, it will likely become a standard architecture for AI model interaction similar to the x86 architecture in traditional computing.
Even if the GPT ecosystem can be seen as the x86, there is room for other architectures similar to ARM that require smaller context length and may be faster & more efficient.
The blog post compares OpenAI to Intel as a “neural compute” maker, and it remains to be seen if they become more like Apple and build a full “neural computer” and an App ecosystem around it.
Feel free to reach out if this post is striking to you :)
Thanks to Jerry Liu, Demren Sinik, Catherine Wu, Brad Porter, and Noah Goodman for their comments and feedback on the post.
Please share & subscribe to the substack, and I will be posting more of my musings regularly.
Thanks for writing this
Very insightful!