Artificially Intelligent Agents and Companies of the Future

25 min readDec 12, 2024

ABSTRACT

As Artificial Intelligence (AI) researchers and company executives increasingly expect Artificial General Intelligence (AGI) to arrive as soon as 2027, AI agents promise to revolutionize how companies in the future will be run. Specifically, as Large Language Models (LLMs) gain longer context windows, achieve smaller test-time compute overhang, and learn to reason in a general and scalable way, agentic behavior capable of achieving general tasks will be possible. This could lead to a workforce of remote agents that operate continuously at optimal energy and focus, without the slowdowns caused by less efficient team members. However, this transformation does not come without challenges. Immediate investments in making models more efficient without having to increase compute power can improve models dramatically on the order of several magnitudes. Examples include leveraging machine learning methods like Reinforcement Learning from Human Feedback (RLHF) and Chain-of-thought (CoT). AI model training, such as the multi-agent training platform “Autogen” from Microsoft or the Self-Taught Reasoner (STaR) method that improves LLM outputs by teaching to “think before you speak,” is also an indispensable area of investment to improve output usefulness and accuracy. Lastly, significant investments in power generation and compute power are necessary to support the industrial infrastructure for AGI, Artificial Superintelligence (ASI), and beyond. But, for venture capitalists, the most fair deals may come from investing in the right companies in the application or “picks & shovels” layer of AI to capitalize on an agentic future.

INTRODUCTION (OOM improvements in AI from 2019 to 2024)

As humans, we often struggle to grasp exponential growth in our minds. A 2015 essay by the popular writer Tim Urban illustrates this “Law of Accelerating Returns” particularly well. He asks us to imagine a time machine that could bring someone from 1750 to 2015. It would be a near death experience for the 1750 person to witness magical bricks that can communicate with someone on another continent and shiny capsules that race by on a highway. But if that same 1750 person decides to bring someone from 1500 to his world (roughly the same 250-year period), the person from 1500 would be much less shocked at the world in 1750. Sure, they would learn some otherworldly facts about space and physics, and learn about Europe’s new imperialism fad, but everyday life — communication, transportation, etc. — would not impress him nearly as much as if they had visited the world in 2015. In order to impress the person from 1500, the 1750 person would need to bring someone from 12,000 BC, a time before the First Agricultural Revolution that gave rise to cities and to the concept of civilization. Indeed, it takes someone from the hunter-gatherer age to have that same kind of near-death experience upon seeing sprawling empires filled with towering churches, ocean-crossing ships, and the enormous human knowledge and discovery that we have collected. Essentially, Urban’s point is this — the average rate of advancement between any given period is faster in the more recent period because the recent period was a more advanced world. Thus, given our short lifespan, everyday linear experiences, and evolutionary focus on immediate threats and rewards, we are often caught off-guard when new technologies come to utterly transform our societies and everyday-life.

Advancements in AI follow the same Law of Accelerating Returns and is the reason why OpenAI’s GPT-4’s capabilities came as a shock to many. In just 4 short years, GPT-2 to GPT-4 took us from preschooler capabilities to really smart highschooler abilities. Consider the MATH benchmark, which is a set of difficult mathematical questions from competitions including the AMC 10, AMC 12, AIME, and more. When the benchmark was released in 2021, GPT-3 only got about 5% of the problems right. Within just one year, the best models went from ~5% to 50% accuracy by mid-2022, and it is over 90% accurate today with the release of Google’s Gemini 1.5. Similarly across other subjects, GPT-4 is several times better than the previous-gen GPT-3.5 at the SAT, LSAT, GRE, AP Calculus BC, and more as illustrated in this figure below.

While AI models have yet to beat humans at the hardest unsolved benchmarks such as the GPQA, a set of PhD-level questions ranging in subjects written by PhDs and their answers made to be unsearchable online, GPT-4 already surpasses the level of that of a “highly-skilled non-expert” at 39% accuracy (as compared to 34% by the non-expert and 65% by PhD domain experts). And if we continue to make improvements across compute and algorithmic efficiencies, we can surpass the models’ current capabilities on the order of several magnitudes (Situational Awareness, p.19).

AGENTIC BEHAVIOR

With massive scale-ups in compute capabilities driven by behemoth investments across foundation providers, and significant progress in algorithmic efficiencies that dramatically lower costs for equivalent performance, agentic behavior is expected to become the next iteration of how we interact with LLMs and AI in general. I will explain how current techniques in model improvement, assuming future improvements in models’ raw capabilities, will naturally produce AI that looks more like an agent or co-worker rather than a chatbot.

Current model restraints

Compute overhang
Limited tools
Lack of critical thought

Compute Overhang

Currently, models can only achieve very simple tasks despite being very powerful. This is akin to a scientist who can only work on a difficult problem for 5 minutes or a software engineer who can only write skeleton code for a single function — they cannot make any breakthroughs or be very useful. But what if instead of working on problems for just five minutes, the model can understand a large task, make a plan, iterate and then execute over the course of several days, weeks or even months?

(Situational Awareness, page 35, Leopold Aschenbrenner)

The above table demonstrates that current models can only effectively use on the order of hundreds of tokens within a single context window before errors accumulate and its thoughts become incoherent. Even though model providers have increased the context-window for their chatbots, it is only effective for their consumption of tokens (intaking longer questions, papers, and source files) and not the production of tokens (generating outputs like texts, documents, or code). If we can unlock the model’s substantial untapped computational capacity to generate outputs (referred to as test-time compute overhang), then we can imagine models using millions of tokens at a time to do the work that, say, a consultant or banker does over the course of a few weeks on a project or M&A deal.

Limited Tools

Besides having back-and-forth conversations with chatbots today, they are largely unhelpful when it comes to more complex tasks that we do on a daily basis, such as sending emails, joining meetings, and using various apps and dev tools, and so on. This is because they are not equipped with the right tools to navigate and use these applications.

Some companies have started to progress towards making AI more relevant and helpful to our daily lives. Rabbit, a startup company making AI-native devices, has developed their own proprietary Large Action Model (LAM) that allows chatbots to take actions on behalf of their user by request. Users can train their “rabbit” to carry out daily, high-frequency tasks like ordering takeout food or calling an Uber. But their model is ultimately unsustainable because the self-training step makes the chatbot’s capabilities unscalable and its accuracy for more complex UI interactions is not guaranteed, potentially leading to unwanted actions.

Another company that is making strides is Exa, who is making the web more interpretable for AI agents by using embeddings-based search. Traditional search engines rely on lexical search methods, which is where the search engine tries to match the keywords in your query to keywords in the documents. This is fast and works well enough in most cases for structured data. But to search effectively through unstructured data, like images and graphs, search needs to go beyond simple keywords and strives to include the intent and contextual meaning. To do this, Exa converts huge datasets of text into an array of numbers called a vector. Then, based on a series of calculations, the search engine produces results according to the closest distance between query and result. Thus, these vectors encode the meaning of the text and are much easier for computers to work with.

With powerful tools that are optimized for AI agents, chatbots of the future will look a lot more like a drop-in remote worker who is onboarded like a new hire, messages to you and your colleagues on Slack, joins meetings and uses the same softwares that you do.

Lack of Critical Thinking

The current capabilities of chatbots and AI and what it can be capable of in the future can be compared to the idea of “System I vs. System II thinking” in human psychology. These two systems of thought were characterized by the Nobel Prize Laureate and psychologist Daniel Kahneman in his book published in 2011 called Thinking, Fast and Slow. System I thoughts are purely automatic. Imagine taking the train to work on a route you have taken for the last 2 years. It takes you only a small amount of cognitive resources to get on and off stops and those actions happen almost unconsciously. By contrast, System II thoughts are controlled. Imagine now that the train you take every day is broken down, and you must weigh your options between taking the bus and ordering an Uber, or if the weather affords you to walk to work. These thoughts, elicited intentionally and requiring considerably more amounts of cognitive resources, is typical of System II thinking.

In thinking about current AI models, how they are structured, and their limitations, much of the improvements can be made by teaching chatbots how to reason through difficult, long-horizon projects in a “System II” way. Our interactions today with chatbots are spontaneous, unrevised, and more often than not — generic. Even with a few more tries, the chatbot produces comparable answers. By continuously applying reinforced learning techniques and improving compute and token length, we might achieve a crucial unlock in chatbots’ capabilities, producing tokens that were examined critically. One day, we might imagine a stream of millions of words coming through as the model uses tools, does research, communicates with multiple chatbots, tries different approaches, revises its work, and completes big projects on its own.

AI AGENT TRAINING TECHNIQUES

To create intelligent, skilled, and independent AI agents who can assist us on the most difficult tasks, we need to resolve issues around compute overhang, available tools, and critical thinking. To that end, we must look at techniques at the frontier of machine learning and training platforms.

Reinforced Learning from Human Feedback (RLHF)

As you scale up the model size (number of parameters), the amount of data, and the compute used for training, the model’s performance improves in a predictable way. This relationship follows a power-law, meaning that as you double the model size or dataset size, you can predict how much better the model will perform. However, there is a point of diminishing returns, where simply adding more data or compute does not lead to proportional improvements in performance.

This is where machine learning techniques, like reinforced learning from human feedback (RLHF) techniques come in to drive additional performance. Optimizing a model based on human feedback is beneficial when a task is challenging to define but straightforward to evaluate. For instance, training a model to generate safe text that is both helpful and free of harmful content, such as bias or toxicity, can be difficult if we manually created the dataset. Crafting hundreds of thousands of bias-free examples would be time-consuming and complex. However, humans excel at quickly assessing and comparing the harmfulness of AI-generated text. Consequently, a more practical approach is to leverage human feedback to refine and improve the model’s text generation. A group of scientists at OpenAI demonstrated the power of RLHF in a study published in 2022, where the smaller, RLHF-manipulated model called InstructGPT (with 100x fewer parameters) outperformed the larger, non-RLHF model in terms of grader scores. Although InstructGPT still makes simple mistakes, their findings indicate that fine-tuning with human feedback is a promising approach for aligning language models with human intent.

Chain of Thought (CoT): Quiet-Self Taught Reasoner (STaR)

Traditional language models can generate text that appears coherent, but they often struggle with tasks requiring deep reasoning, such as answering complex questions or making logical inferences. This limitation arises because standard training methods do not explicitly teach the model how to reason through problems step by step with a “chain-of-thought”. Instead, these models often rely on surface-level patterns in the data, which can lead to incorrect or superficial answers. A common solution to correcting the model is to actively prompt it to “think out loud”. The model will then realize that it had made a mistake, and will reattempt the problem for correctness. But this method is time-consuming and unscalable. Instead, the Self Taught Reasoner (STaR) technique was developed to address this issue by enabling language models to learn from their own reasoning processes, allowing them to iteratively improve their ability to solve complex tasks.

STaR (introduced by Stanford & Google Research teams in 2022) solves this problem by introducing a mechanism where the model generates rationales — explanations or reasoning steps — during training. This technique involves a straightforward iterative process: first, generate rationales to answer multiple questions, using a few examples as prompts. Next, if the generated answers are incorrect, regenerate the rationale using the correct answer. Finally, fine-tune the model on all rationales that eventually led to correct answers, and repeat the process. While models trained on the STaR method performed comparably to a 30X larger, fine-tuned state-of-the-art language model on CommensenseQA, the model inherently covers just a subset of reasoning tasks since it was trained on high-quality, carefully curated question-answering (QA) datasets. This restricts the diversity and generalizability of the reasoning skills it can develop.

To improve generalizability, a group of researchers at Stanford developed Quiet-STaR in 2024, which builds on the foundation laid by STaR but adds additional training beyond the usual QA tasks on a large internet corpus. This allows the model to develop reasoning skills that are applicable to a much wider variety of tasks. Specifically, it is now trained to 1)Think: The model starts by generating several possible thoughts or ideas after each part of a sentence (see “sampled thought”) in parallel; 2)Talk: The model then compares what it predicts might happen next with and without these thoughts. 3) Learn: Finally, the model checks which thoughts were helpful in making a better prediction (Quiet-STaR, pg 2). It rewards and keeps those helpful thoughts and discards the ones that didn’t help, improving its reasoning for future tasks. Instead of asking the model to think “out loud” with a traditional CoT approach, Quiet-STaR allows a model to think “quietly” at every token, generating explicit CoT reasoning without in-context examples. This allowed the model to generate more structured and coherent chains of thought.

This figure from the original research paper illustrates how Quiet-STaR works by thinking, talking, and learning.

Additionally, Quiet-STaR introduces new techniques, such as tokenwise parallel sampling (generating multiple rationales for each token) and custom meta-tokens (special tokens developed to signal specific functions within the language model), to efficiently manage the computational cost of generating rationales at every token.

This figure is taken from the original research paper to demonstrate models’ results after being trained on Quiet-STaR. The left plot (a) shows the zero-shot accuracy on GSM8K, while the right plot (b) shows the zero-shot accuracy on CommonsenseQA, without any fine-tuning. The x-axis represents training steps, and the y-axis measures the zero-shot direct accuracy on the respective datasets. Each line on the graph corresponds to a different number of thinking tokens used during Quiet-STaR training.

The group’s results showed that on the CommonsenseQA dataset(questions that require common sense reasoning), Quiet-STaR improves zero-shot (first attempt at the problem sets) performance by 10.9% compared to the base language model. It also had a 5% improvement on the GSMK8 (a dataset of 8.5K high quality linguistically diverse grade school math word problems), ultimately making it a more capable tool for aligning language models with human-like reasoning (Quiet-STaR, pg 8).

Autogen

As the tasks that benefit from LLMs become increasingly complex, a single agent often struggles to handle all aspects, such as reasoning, tool usage, and adapting to new information. Traditionally, LLM applications were limited by the lack of coordination between agents, leading to inefficiencies and limited scalability. AutoGen, a tool developed by Microsoft, aims to solve this problem by enabling the creation of multi-agent systems where agents can work together, share knowledge, and perform tasks in a more modular and efficient manner.

AutoGen introduces a framework that allows developers to create and customize “conversable agents” — agents that can communicate with each other, take actions, and solve tasks collaboratively. These agents can be backed by LLMs, tools, human inputs, or a combination of these, making them highly versatile. AutoGen simplifies the development process through “conversation programming,” where developers define how agents interact using natural language and programming languages. This approach streamlines the workflow by breaking down complex tasks into manageable sub-tasks, which different agents can handle independently or in collaboration with others. The framework also supports flexible conversation patterns, enabling agents to adapt to different scenarios, whether they require static, pre-defined interactions or dynamic, real-time decision-making.

This figure taken from the original research paper shows different kinds of use cases that Autogen enables.

Platforms such as Autogen paved the way for a wave of new startups focused on truly-automated agent solutions for enterprises like Decagon.ai, which provides enterprise-grade generative AI solutions for customer support. Other studies surrounding agentic behavior, such as one from Stanford University that simulated a community of 25 unique agents, each with their unique identity, memory, behavior, and ability to act, show that generative agents can produce highly believable behavior, both individually and as part of a social group.

This figure is taken from the original research paper that demonstrates the “sims” like experiment the researchers ran.

These findings have significant implications, promising advancements not only in enterprise and consumer markets but also in the social sciences, by harnessing the power of advanced AI agents.

AREAS OF KEY INVESTMENTS

Today’s AI technology (notably set off by the arrival of ChatGPT) has begun to significantly impact corporate productivity and cost efficiency, even though fully autonomous AI agents capable of handling complex, remote projects are not yet a reality. For example, ServiceNow has implemented “Now Assist”, an AI-driven tool that has achieved a case avoidance rate of nearly 20%, demonstrating the technology’s potential to streamline customer service operations and reduce the need for human intervention. Similarly, Palo Alto Networks has utilized AI to cut the costs associated with processing expenses, while HubSpot has successfully scaled its customer support capabilities through AI integration. Perhaps most striking is Klarna, which has reported over $40 million in run-rate savings by embedding AI into its customer support systems, illustrating the substantial financial benefits AI can bring when strategically applied.

These examples underscore a broader trend: AI is ushering in a productivity revolution that is more profound than previous communication revolutions driven by the internet and mobile phones. While these earlier technologies primarily facilitated faster and more efficient communication, AI is akin to the personal computer (PC) in that it is reshaping the very foundations of business and industry. Research from McKinsey supports this perspective, estimating that AI could boost global corporate profits by $2.6 trillion to $4.4 trillion annually. This surge in profitability is expected to be driven by AI’s capacity to enhance productivity and efficiency across various sectors, including banking, retail, and life sciences. As AI evolves from basic text or code generation to more sophisticated agentic interactions, it will continue to drive significant changes in how businesses operate.

In order to reap the rewards that advanced AI promises, we need to have the necessary underlying infrastructure to support it. The rise of the PC and smartphone sparked a demand for greater internet bandwidth to support data transmission. Similarly, the evolution of AI agents will drive the need for more powerful computing infrastructure and enhanced crosstalk capabilities. However, these infrastructure improvements will require substantial investments from firms. As AI revenue grows rapidly, potentially reaching trillions of dollars by the end of the decade, there will be an intense push towards expanding GPU capacity, building more advanced data centers, and increasing power production. This industrial mobilization will likely require the full participation of both governments and large corporations, as growing U.S. electricity production by significant percentages to support AI’s infrastructure demands will be a monumental task.

For private investment firms, this landscape presents unique opportunities. The focus should be on investing in either the “picks and shovels” that enable AI agents or companies with strong value propositions in the application layer. I present several case studies of companies to explain why they exemplify areas that venture firms should be broadly investing in:

AI Infrastructure Case Study: Etched.ai

Etched AI is a compelling example of the type of infrastructure investment VC firms should be making in the AI space, particularly in compute, due to the escalating demands and costs associated with running large language models (LLMs) like ChatGPT.

To begin with, the ongoing inference required to run models like ChatGPT is immensely expensive, with projections suggesting that OpenAI alone might spend $400 million annually on compute. This cost is driven by the need for powerful hardware like NVIDIA’s DGX A100 systems, which, while effective, are not optimized specifically for LLM workloads and come with a hefty price tag of around $300,000 per unit. When companies like Microsoft aim to integrate such AI models into mainstream products like Bing, they face infrastructure costs that could reach up to $4 billion, underlining the unsustainable nature of current compute expenses for AI.

Many startups in the generative AI space are currently operating at a loss because their compute bills are prohibitively high. As the adoption of LLMs increases, the demand for compute power will multiply exponentially, putting even more strain on companies already at a breaking point. In this environment, there’s a pressing need for more efficient, cost-effective solutions to keep pace with the growing demand for AI-powered applications.

Etched AI is uniquely positioned to address this challenge. Their approach focuses on developing specialized hardware optimized specifically for LLM workloads, rather than relying on general-purpose AI accelerators like GPUs or TPUs. These traditional chips, while powerful, are not optimized for LLMs, leading to significant inefficiencies and higher costs. Etched AI’s hardware, by contrast, is designed to maximize performance for LLMs, delivering over 100x the performance of similarly priced GPU clusters. This specialization enables them to provide an order of magnitude more throughput and 20x lower latencies compared to traditional systems.

Moreover, Etched AI’s system is drop-in compatible with popular Transformer libraries and integrates seamlessly with ecosystems like Hugging Face and NVIDIA’s TransformerEngine, making it a practical and accessible solution for companies already entrenched in existing AI frameworks.

Sohu, Etched’s first Transformer chip, claims to be 20x faster than NVIDIA H100s; Source: etched.ai

Etched AI’s initial focus on serving smaller, emerging cloud providers — outside of the major players like AWS, GCP, and Azure — further underscores its potential impact. These providers, such as Lambda Labs, Runpod, and Coreweave, cater to startups and early-stage companies that are often constrained by access to GPUs and the high costs associated with AI infrastructure. By offering a more efficient and cost-effective solution, Etched AI enables these companies to scale their AI applications without the prohibitive expenses that have historically limited their growth.

AI Picks and Shovels Case Study: LangChain

While language models are unlocking new types of high-value applications (e.g. ChatGPT), it is still non-trivial to create and maintain these applications — particularly in production settings. This is caused by an array of obstacles such as the models’ complexity, expertise needed to fine-tune them on datasets, scalability and performance for high-value applications, and more. LangChain has made it easier for developers to build enterprise grade applications built on top of language models via a high-level interface. Developers can either assemble the chains themselves or select pre-built chains that can accomplish a variety of tasks. This way, a language model is made more powerful as it is connected to a source of data and able to interact with its environment.

A case where LangChain provided immense value-add was for its client, New Computer, who was developing a personal AI agent designed to truly understand its users — Dot. As a personal AI, Dot needed to retain users’ information long-term, learn their preferences over time, and be able to retrieve them quickly. To do this, the New Computer team needs to structure the information during memory creation in order to make retrieval possible, and in the future, accurate and efficient. A regular Retrieval-Augmented Generation (RAG) method would not have sufficed. Using LangChain’s LangSmith product, the New Computer team labeled relevant memories for each query and defined evaluation metrics, allowing them to quickly iterate on improving retrieval for the agentic memory system. Then, the team ran multiple experiments on LangSmith’s easy-to-use SDK and experimental environment to run, evaluate, and inspect the results. These experiments enabled New Computer to significantly improve their memory systems, leading to 50% higher recall and 40% higher precision compared to a previous baseline implementation of dynamic memory retrieval (based on regular semantic search).

A quality platform like LangChain is required to build quality AI applications. So investing in the “picks and shovels” will almost certainly create new possibilities for founders to build the next generation of exceptional applications in AI.

AI Application Case Study (1): Decagon.ai

The most successful AI application founders are those who can dive deep on a technical level but can also understand how to solve real human problems. Customer service is perhaps one of the most gnawing human challenges businesses face. Users’ requests are diverse in nature, and solutions often require out of the box thinking. When AI solutions are implemented to address this challenge, they often fail to boost efficiency because, once the AI agent exhausts its decision tree, human intervention is required to resolve the issue. As a result, many AI products intended to enhance customer service efficiency end up being counterproductive for employees, leading to diminished ROI and the eventual abandonment of these AI projects.

One of the truly successful cases of AI implementation of customer service is Decagon.ai. Decagon leaps beyond a simple decision tree bot and humanizes users’ requests, checks against company rules and policies, and resolves their requests — all without human action and complex decision trees or canned responses. Their fully automated solution dramatically increases the speed and efficiency of ticket resolution, while raising customer satisfaction (CSAT). An example of their effectiveness is when they helped Bilt Rewards, a technology company that allows renters to earn points on rent, reduce customer service complexities and increase satisfaction. Before implementing Decagon, Bilt struggled with a popular incumbent solution that couldn’t handle the complexity of their business logic and data, leading to significant overhead in maintaining decision trees and rules-based systems. As the product evolved, this burden only increased. Additionally, the customer experience was poor, with bots unable to resolve simple tickets due to integration issues, resulting in impersonal and ineffective responses. This lack of effective tools turned the support operation into a cost center rather than a growth driver.

Decagon’s solution helped Bilt save thousands of support tickets and interactions month-over-month by leveraging Large Language Models to understand complex business logic integrated with their internal data. This led to a significant increase in both CSAT and Net Promoter Score (NPS). Ultimately, Decagon’s AI agents exemplify the progress that companies across the AI infrastructure chain where agents can now be “onboarded” to learn a company’s rules and policies, use a “Systems II” approach to evaluate requests, and take actions on behalf of the company — while raising customer satisfaction. With the right implementation, AI solutions can be much cheaper, faster, and more efficient at labor-intensive and repetitive tasks previously done by humans.

AI Application Case Study (2): Stratum.ai

AI technologies are so versatile that they can even be applied to an industry as traditional as mining. The mining industry is crucial for global energy and technology, supporting companies like Tesla and Nvidia. However, it faces significant challenges as only two minerals, coal and gold, exceed 2019 prices in real terms, and production costs have surged by 30% over the last five years. Declining ore grades and deeper deposits have driven costs higher, leading to a 7% drop in revenues and a 44% fall in profits in 2023. With further declines expected in 2024, PwC’s 2024 Mine report urges miners to invest in growth and transformation despite these pressures.

Stratum AI is a mine planning and grade control platform powered by machine learning. By utilizing deep learning, it develops 3D models that precisely pinpoint the location of high-grade ore deposits within mines, enabling miners to maximize profitability from each site. The superior accuracy of Stratum’s model yields an extra 10% in revenue for every mine (5/26/2023 Deck, Slide 3, Stratum AI).

Stratum’s model generates highly accurate 3D models of mines by leveraging machine learning. With data provided by miners from each drillhole, or narrow holes drilled into the ground that reveals the amount of mineral at any spot, Stratum’s AI model learns geological patterns and outputs a 3D model of precise ore locations. This precise data then enables miners to increase average mined grade by 8–12%, reduce waste sent to mill by 20–40%, and correct errors faster with their real-time mine tracking feature.

A GLIMPSE INTO THE FUTURE: OPENAI O1

Released in September 2024, OpenAI’s o1 series models (including o1-preview and o1) were trained using some of the advanced machine learning methods that I have outlined above, resulting in much higher accuracy than its predecessors and affirms the agentic future that is poised to arrive.

Using reinforcement learning to perform complex reasoning, o1 can now produce a long internal chain of thought before responding to the user that is both more accurate and effective. Some initial test results showed that o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%.

The full test results of OpenAI o1 across various benchmarks compared to its predecessor, GPT-4o

Four more charts visualize the improvements that o1 achieved over GPT-4o

Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. For instance, it learns to recognize and correct its mistakes, to break down tricky steps into simpler ones, and try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason and is the reason for its drastic improvements. As future models are trained on advanced ML algorithms like RLHF, their ability to think more accurately and clearly across a longer context window will improve, bringing us closer to a future where advanced, reasoning agents are a reality.

CONCLUSION

To be fair, these recommendations are all investments in LLMs, and many skeptics argue that the transformer architecture powering these models are not advanced enough to make AI agents truly “think” in the same way that humans do (or reach AGI). In a recent prominent article, TechCrunch published a finding that revealed LLMs cannot spell words — specifically, when asked about how many “r”s are in the word “Strawberry”, it does not know. It is true — current generative AI technologies struggle with spelling simple words like “strawberry” and rendering fine details like fingers due to the limitations of the underlying transformer architecture in LLMs. These models break text into individual tokens but lack a true understanding of individual letters or fine image details within those tokens. This issue stems from how these models process information, converting text into numerical representations rather than comprehending it as humans do.

However, the same question can be posed to the skeptics: if current LLMs are truly so rudimentary and lacking in thought, how have they already generated significant value for businesses and consumers alike? There are undeniable statistics behind AI’s impact on society: each week, nearly 60% of the U.S. population use ChatGPT (200 million weekly active user / 341 million U.S. population), and 92% of Fortune 500 companies are using OpenAI’s products. In education, a study found that AI tools improved students’ grades by 30% while reducing their anxiety by 20%. In medicine, around 38% of medical providers use AI systems as their diagnosis assistant. National Bureau of Economics Research also suggests that AI could lead to savings of 5%-10% of total healthcare spending in the U.S., which translates to $200 billion — $360 billion annually that can be used to treat more patients at lower cost.

However, the rapid and widespread adoption of OpenAI’s ChatGPT, with 1 million users signing up in just one week, might be dismissed as part of a hype cycle. While it’s possible that some AI services may not endure, the undeniable impact of AI, supported by solid statistics, shows its value to society. Only time will reveal which AI services have lasting significance and which will fade as mere trends.

This figure shows Gartner’s Hype Cycle Graph, where new technologies will often go through several stages of peaks and troughs before establishing its actual value.

But if AGI does arrive in the near future, companies and societies in the future will look a lot more different than they do today. First, they will be shaped by AI agents, with entire organizations functioning like neural networks. This accessible productivity will accelerate company formation, although success is not guaranteed. In an early example of a startup building remote AI employees for enterprises, Artisan has automated 80% of a sales Business Development Representative’s job (BDR) with their first AI agent, Ava. Specifically, Ava finds and researches leads using dozens of data sources, sends them hyper-personalized messages on LinkedIn and email, manages deliverability, and more. From their traction figures, companies are seemingly more than happy to “hire” Ava, who costs 96% less than a human BDR. In 3 months, Artisan has gained 120 enterprise customers and surpassed $1 million in ARR. And they are not stopping with Ava; Liam, the Marketing Artisan, and James, the Customer Success Artisan, are all in their future products pipeline. Artisan’s growth could signify a future where companies are made up of mostly AI employees, and a single AI engineer orchestrates the company. Therefore, new ownership and management structures will emerge. Business leaders will need to consider critical questions: What products will we create? How will the workforce evolve? And how will the total addressable market (TAM) be divided between humans and AI agents?

Building AI products is resource-intensive in nature when you consider how much capital, compute, and training are needed to perfect models and build effective products. This could mean that only 1–2 technology companies of the future can dominate a single market, offering similar products and competing fiercely. This near-monopolistic structure is evidenced by today’s generative AI industry, where OpenAI is far and away the market leader. In October 2024, it raised the largest VC round of all time — $6.6 billion at a $157 billion valuation; ChatGPT has more than 250 million users and OpenAI’s annualized revenue has reportedly eclipsed $3.4 billion; OpenAI optimistically projects its revenue will reach $100 billion in 2029 — matching the current annual sales of Nestlé. In other AI-powered industries, similar monopolistic or oligopolistic structures may follow, such as generative search.

Given that massive companies across AI industries will rise and put wealth in the hands of a few, societies may experience significant shifts in wealth distribution and regulatory frameworks. Wealth disparities could widen as AI-driven companies streamline operations, potentially concentrating profits among a few AI engineers and owners. New regulations may emerge to address the ethical implications of AI autonomy, while income structures might evolve to include universal basic income or equity dividends.

To conclude, the rapid advancement of AI technologies offers immense potential to revolutionize industries and improve lives. However, we must approach this progress with deep caution, ensuring that ethical considerations, regulatory frameworks, and societal impacts are carefully managed to build a strong and equitable future.