#13: How AI can use software

AI Agents, Tidy First, and two uplifting reads

May 10, 2024

Hey everyone,

It’s been a while since I published anything for this newsletter. Things have been tough lately in the startup land, and I was not comfortable with the thought of spending time writing for this newsletter instead of working on my startup.

Now that I have more clarity in my head, I will write again, albeit with a lower intensity than before.

If you are a new subscriber, please expect an occasional email from me. I’m mostly writing about engineering at startups and commenting on pieces of content that sparked my curiosity.

How AI can use software

A while back, I wrote about AI agents as the next evolution of AI tools. Fast-forward a couple of months, and things have changed a lot. New and more robust models are released (GPT4, Llama3, Gemini), and viral AI agent startups are coming out (Devin). And lastly, visual-focused models such as GPT-4V and LLaVA have made inspecting images (including computer interfaces) much easier.

Since my startup revolves around the future of work, I am heavily investing my time learning more about how AI can use software. I believe this will be one of the biggest opportunities in the tech world.

There are at least three ways AI can use software:

Through API

This is the most common way for AI to interact with other software. Even most of the AIs in the ChatGPT plugin marketplace work this way. It’s also the most reliable since the communication contract between software is already established.

Many AI Agents on the internet seemingly use AI like humans would (by pointing and clicking the UIs), but behind the scenes, most actually call APIs.

Through computer vision

Humans rely on visuals to use the software. That’s why there’s the field of User Interface Design, which involves creating interfaces so that humans can effectively use software.

With multimodal models, especially those equipped with vision understanding, AI can mimic how humans see and use software. The models analyze the software's rendered visuals and identify where the texts, buttons, and forms are. The LLM component then decides what to do next.

Through LLM

If the software is a web app, LLM can be used to analyze the markup code. It can detect texts, buttons, input boxes, and other UI elements.

As opposed to traditional RPA (Robotics Process Automation) software, where the bot would only work if the user provides very specific UI elements to work on (which may change due to website updates, etc.), AI Agents can use LLM to select the correct UI elements intelligently.

Here are two projects I found recently that might be worth taking a look at if you are into this topic:

1. Adept AI’s Fuyu-8B Models

Adept AI is one of the most funded startups pursuing the future of AIs working directly with other software. They have published many whitepapers and prototypes, but there is no working product yet.

Fuyu-8B is one of their latest multimodal models. It has a simpler architecture and training procedure than other multimodal models, presumably because Adept wanted narrower capabilities that were more suited to their goals. The model is designed to power digital agents to work with graphs, diagrams, and software UI elements, basically anything on screen.

I agree with their approach to building multimodal models to pursue their AI agent goal. For a real AI agent to be as helpful as a human, it needs to analyze the screen that the average human user sees, understand the user's context, and then act on behalf of the users. All of those activities will rely on image recognition and understanding. Text-based LLMs can only work if the input data is represented in a textual format, but then a lot of information will be lost.

2. WebLINX

WebLINX is a pretty cool model tuned for real-world website navigation. It provides a digital agent that controls a web browser and follows user instructions to solve real-world tasks inside and across websites.

The WebLINX agent consists of two components: 1) a Dense Markup Ranker (DMR) and 2) an action model.

Dense Markup Ranker is a specialized model that converts HTML pages into a compact representation with the most relevant elements while discarding the rest.

An action model (can be an LLM or a multimodal model) will then process the inputs (HTML, instructions, history, and images) and generate the subsequent actions to take (clicking buttons, typing into forms, or loading a new page).

A diagram displaying a conversation between a navigator and instructor.

Tidy First, by Kent Beck

Software creates value in two ways:
• What it does today
• The possibility of new things we can make it do tomorrow.

A while back, I read Tidy First by Kent Beck. In software engineering, Kent is known for co-authoring the Refactoring Book and creating the Extreme Programming method. In a sense, Tidy First is Refactoring’s little brother. The book is only 100 pages long and contains dozens of small, actionable advice to improve your code.

The book also touches upon how to fit tidying into your own personal development workflow:

When do you start tidying?
When do you stop tidying?
How do you combine tidying, changing the structure of the code, with changing the behavior of the system?

My favorite part of the book is chapter 21: First, After, Later, Never. With regards to the actual task that you need to do with the code, when do you do tidying? Are you tidying first, tidying after the task, tidying later, or not tidying ever?

In short, Kent argues to:

Tidy never when:
- You’re never changing this code again.
- There’s nothing to learn by improving the design.
Tidy later when:
- You have a big batch of tidying to do without immediate payoff.
- There’s an eventual payoff for completing the tidying.
- You can tidy in little batches.
Tidy after when:
- Waiting until next time to tidy first will be more expensive.
- You won’t feel a sense of completion if you don’t tidy after.
Tidy first when:
- It will pay off immediately, either in improved comprehension or in cheaper behavior changes.
- You know what to tidy and how.

Overall, this book is great and should be on every engineer and engineering leader’s shelves.

Top Finds

Late Bloomer, by Henry Oliver

Have you ever dreamed that you might be far more successful than you are today? Our society tells us over and over that if we're going to achieve anything, we'd better do it while we're young. But whether you're at the start of your career, sensing you're on the wrong path, or feeling unsettled later in life, you're likely wondering just how to reinvent yourself? Have you left it too late?

Whenever we think that we are finished or realize that we are way older than we used to be, remember that late bloomers exist. They are the people who joined a field very late yet still thrive, even outcompeting the younger people.

Also, a related tweet:

Worst Day, by Farnam Street

“Your worst day is a chance to show your best qualities, to stand out, and to learn an enormous amount about yourself. Very few people plan or prepare for what they’ll do and how they’ll act during those times. Those who do might well end up turning their worst day into their best.”

It’s easy to sail a ship in a peaceful sea. For some people, it’s not their best days that determine their character, but what they do in their worst days.

So, keep going on.

Make progress, however small.

Remember that everything compounds.

Patience.

See you next time!

The Stoic CTO

Discussion about this post