The first few days felt smooth. Then reality hit. Here's what broke, what's still broken, and what I'm doing about it.

#ai #android #automation #softwareengineering

The first four days of this build log were about momentum. Create the repo. Wire up Gemma 4. Get ADB working. Add OCR. Ship a working task.

Day 5 is different. This one is about the walls I've hit and the walls I'm still staring at.

If you're building something real—especially from a phone—you know the feeling. The early wins dry up. The problems get harder. The internet doesn't have a tutorial for what you're trying to do. This post is about those moments.

What Broke in the Early Days (That I Didn't Write About)

I made the first few days look clean. They weren't.

On Day 2, after creating the repo and pushing agent.py, I realized ADB wasn't detecting my device. I spent two hours reinstalling packages in Termux, toggling USB debugging, and restarting my phone. The fix was embarrassing: I had accidentally revoked USB debugging authorization when clearing old permissions. One tap in Developer Options and everything worked.

On Day 3, Tesseract OCR kept returning gibberish. The text extraction was so bad I thought the library was broken. Turns out I was passing the screenshot in the wrong format. Tesseract needs a clean, high-contrast image. My screenshots were compressed and noisy. Converting to PNG first fixed 80% of the issues.

On Day 4, the verification layer worked perfectly in testing—then failed every single time on a real WhatsApp screen. The problem? WhatsApp's UI updates asynchronously. The contact list wasn't fully loaded when the agent took its verification screenshot. A 2-second delay before verification solved it. That delay cost me three hours of debugging.

These are the moments that don't make it into the polished posts. But they're the real work.

What I'm Struggling With Right Now

Here's where I'm stuck today.

1. OCR Is Slow and Unreliable

Tesseract takes 8-12 seconds per screen scan on my device. For a single action, that's fine. For a 5-step task, that's a minute of just waiting for text recognition. The fuzzy matching helps with accuracy, but it can't fix the speed.

I'm looking at lighter alternatives—maybe a custom-trained model, maybe switching to Google ML Kit's on-device text recognition. But both options add complexity I wasn't planning for.

2. The Agent Can't Handle Interruptions

If a WhatsApp call comes in mid-task, the agent breaks. If a notification pops up and covers the target button, the agent taps the notification instead. If the screen times out and locks, everything stops.

A human knows to dismiss a call and continue. The agent doesn't. Building a recovery handler for every possible interruption feels like chasing infinity.

3. The Phone Gets Hot

After 10-15 minutes of continuous use, my phone heats up. The CPU throttles. Inference slows down. One time the phone restarted itself mid-task. I don't have a solution for thermal management on a device that was never designed to run AI agents.

4. I'm Still Building Alone

I haven't found collaborators or contributors yet. The repo is public. The code is there. But it's a solo build, and some of these problems need a second brain.

What's Next

These challenges don't mean the project is failing. They mean it's real. Every prototype hits walls. The difference between a prototype and a product is whether you keep going.

Next steps:

Research lighter OCR alternatives
Write a basic interruption handler for the most common failures
Start testing on a second device to isolate hardware issues
Open an issue on GitHub for each known problem—turn struggles into documentation

The Repo

👉 github.com/Dexter2344/phone-agent

If you've hit similar walls—OCR, thermals, UI interruptions—I'd genuinely like to hear how you solved them. Drop a comment or open an issue.

This is Day 5. The honeymoon is over. The real build starts now.