Ashish Bhandari

Posted on Jun 12

From PyTorch to Core ML: My Journey Running Recommendation Models Directly on iPhone

#deeplearning #ios #machinelearning #mobile

Most recommendation systems run entirely on the backend. User interactions are sent to servers, models generate recommendations, and results are returned to the client.

Recently, I explored a different approach: running recommendation inference directly on an iPhone using a Two-Tower model and Core ML.

The goal wasn't to replace backend recommendation systems, but to understand the feasibility, trade-offs, and developer experience of on-device personalization.

Here's what I learned.

Why On-Device Inference?

Traditional recommendation flow looks like this:

User Action
    ↓
 Backend
    ↓
 ML Model
    ↓
 Recommendations
    ↓
 Mobile App

This approach works well, but comes with a few challenges:

Network latency
Backend inference costs
Privacy concerns
Offline limitations

With on-device inference:

User Features
    ↓
 Core ML Model
    ↓
 Recommendation Embedding
    ↓
 Local Ranking

the recommendation logic executes directly on the device.

Benefits include:

Near-instant inference
Reduced backend dependency
Better privacy
Offline recommendation capabilities

Understanding the Two-Tower Architecture

Before implementation, I spent time understanding why Two-Tower architectures are widely used in recommendation systems.

The model consists of:

User Tower

Generates a user embedding from features such as:

User ID
Purchase history
Preferences
Behavioral signals

User Features
      ↓
  User Tower
      ↓
 User Embedding

Item Tower

Generates embeddings for products.

Product Features
       ↓
   Item Tower
       ↓
 Product Embedding

Matching

Recommendations are generated by comparing embeddings.

Similarity(
    User Embedding,
    Product Embedding
)

Products with higher similarity scores are ranked higher.

This architecture is particularly attractive because item embeddings can be precomputed and stored, leaving only user embedding generation to happen on-device.

Converting the Model to Core ML

One of the first challenges was model conversion.

I initially assumed that converting a PyTorch model into Core ML would be straightforward.

Reality was slightly different.

A common issue was discovering that the downloaded .pt file contained only model weights (state_dict) rather than the actual architecture.

For example:

type(torch.load("model.pt"))

returned:

collections.OrderedDict

which meant the model architecture had to be reconstructed before conversion.

After rebuilding the model and exporting it through Core ML Tools, I obtained:

TwoTower.mlpackage

which could be integrated directly into an iOS project.

Integrating with Xcode

Importing the model into Xcode automatically generated Swift bindings.

One lesson I learned was not to assume generated class names.

I initially tried:

let model = UserTower()

which failed because Xcode generated classes based on the model package name rather than internal tower names.

The generated class was actually:

let model = TwoTower()

A small detail, but one that cost more debugging time than I'd like to admit.

Running Inference

Once integrated, inference became surprisingly simple.

let prediction = try model.prediction(
    userId: userId
)

The Core ML runtime handled execution using available hardware accelerators.

The developer experience was significantly smoother than expected.

Performance Observations

The most impressive part was latency.

Since inference happened locally:

No network request
No API dependency
No server round-trip

The user experience felt immediate.

For recommendation systems, even a few hundred milliseconds can impact engagement.

On-device inference effectively removes an entire network hop from the critical path.

Challenges I Encountered

Model Conversion

This was by far the biggest challenge.

Questions that frequently came up:

Is the model TorchScript?
Is it only a state dictionary?
What are the expected input shapes?
Which outputs are generated?

The conversion pipeline often requires more ML engineering knowledge than mobile engineering knowledge.

Input Feature Engineering

A recommendation model is only as useful as the features provided.

Preparing inputs consistently between training and inference environments is critical.

Even small mismatches can significantly affect recommendation quality.

Debugging Core ML Models

Debugging application code is easy.

Debugging machine learning outputs is harder.

When recommendations look wrong, the issue could be:

Data preprocessing
Feature encoding
Model conversion
Training quality

Finding the root cause requires a systematic approach.

What Surprised Me Most

Before this project, I assumed recommendation systems were too heavy for mobile devices.

The reality is that modern phones are extremely capable.

For many personalization use cases, running lightweight recommendation models directly on-device is entirely practical.

The challenge isn't necessarily inference performance.

The challenge is building the surrounding ecosystem:

Feature pipelines
Embedding storage
Ranking strategies
Model updates
Experimentation frameworks

Key Takeaways

Two-Tower architectures are well-suited for on-device recommendation systems.
Core ML integration is easier than expected once the model is correctly converted.
Model conversion is often the hardest part of the workflow.
On-device inference dramatically reduces latency.
Privacy and offline capabilities become significant advantages.
The surrounding recommendation infrastructure is usually more complex than the inference itself.

Final Thoughts

This project started as an exploration into mobile machine learning and quickly became a deeper lesson in recommendation systems.

As mobile hardware continues to improve, I expect more personalization workloads to move closer to the user.

Running recommendation inference directly on-device won't replace backend recommendation systems entirely, but it opens interesting possibilities for low-latency, privacy-preserving, and offline-first user experiences.

For mobile engineers interested in machine learning, Two-Tower models and Core ML are an excellent place to start.

DEV Community