DEV Community

Ashish Bhandari
Ashish Bhandari

Posted on

From PyTorch to Core ML: My Journey Running Recommendation Models Directly on iPhone

Most recommendation systems run entirely on the backend. User interactions are sent to servers, models generate recommendations, and results are returned to the client.

Recently, I explored a different approach: running recommendation inference directly on an iPhone using a Two-Tower model and Core ML.

The goal wasn't to replace backend recommendation systems, but to understand the feasibility, trade-offs, and developer experience of on-device personalization.

Here's what I learned.

Why On-Device Inference?

Traditional recommendation flow looks like this:

User Action
    ↓
 Backend
    ↓
 ML Model
    ↓
 Recommendations
    ↓
 Mobile App
Enter fullscreen mode Exit fullscreen mode

This approach works well, but comes with a few challenges:

  • Network latency
  • Backend inference costs
  • Privacy concerns
  • Offline limitations

With on-device inference:

User Features
    ↓
 Core ML Model
    ↓
 Recommendation Embedding
    ↓
 Local Ranking
Enter fullscreen mode Exit fullscreen mode

the recommendation logic executes directly on the device.

Benefits include:

  • Near-instant inference
  • Reduced backend dependency
  • Better privacy
  • Offline recommendation capabilities

Understanding the Two-Tower Architecture

Before implementation, I spent time understanding why Two-Tower architectures are widely used in recommendation systems.

The model consists of:

User Tower

Generates a user embedding from features such as:

  • User ID
  • Purchase history
  • Preferences
  • Behavioral signals
User Features
      ↓
  User Tower
      ↓
 User Embedding
Enter fullscreen mode Exit fullscreen mode

Item Tower

Generates embeddings for products.

Product Features
       ↓
   Item Tower
       ↓
 Product Embedding
Enter fullscreen mode Exit fullscreen mode

Matching

Recommendations are generated by comparing embeddings.

Similarity(
    User Embedding,
    Product Embedding
)
Enter fullscreen mode Exit fullscreen mode

Products with higher similarity scores are ranked higher.

This architecture is particularly attractive because item embeddings can be precomputed and stored, leaving only user embedding generation to happen on-device.

Converting the Model to Core ML

One of the first challenges was model conversion.

I initially assumed that converting a PyTorch model into Core ML would be straightforward.

Reality was slightly different.

A common issue was discovering that the downloaded .pt file contained only model weights (state_dict) rather than the actual architecture.

For example:

type(torch.load("model.pt"))
Enter fullscreen mode Exit fullscreen mode

returned:

collections.OrderedDict
Enter fullscreen mode Exit fullscreen mode

which meant the model architecture had to be reconstructed before conversion.

After rebuilding the model and exporting it through Core ML Tools, I obtained:

TwoTower.mlpackage
Enter fullscreen mode Exit fullscreen mode

which could be integrated directly into an iOS project.

Integrating with Xcode

Importing the model into Xcode automatically generated Swift bindings.

One lesson I learned was not to assume generated class names.

I initially tried:

let model = UserTower()
Enter fullscreen mode Exit fullscreen mode

which failed because Xcode generated classes based on the model package name rather than internal tower names.

The generated class was actually:

let model = TwoTower()
Enter fullscreen mode Exit fullscreen mode

A small detail, but one that cost more debugging time than I'd like to admit.

Running Inference

Once integrated, inference became surprisingly simple.

let prediction = try model.prediction(
    userId: userId
)
Enter fullscreen mode Exit fullscreen mode

The Core ML runtime handled execution using available hardware accelerators.

The developer experience was significantly smoother than expected.

Performance Observations

The most impressive part was latency.

Since inference happened locally:

  • No network request
  • No API dependency
  • No server round-trip

The user experience felt immediate.

For recommendation systems, even a few hundred milliseconds can impact engagement.

On-device inference effectively removes an entire network hop from the critical path.

Challenges I Encountered

Model Conversion

This was by far the biggest challenge.

Questions that frequently came up:

  • Is the model TorchScript?
  • Is it only a state dictionary?
  • What are the expected input shapes?
  • Which outputs are generated?

The conversion pipeline often requires more ML engineering knowledge than mobile engineering knowledge.

Input Feature Engineering

A recommendation model is only as useful as the features provided.

Preparing inputs consistently between training and inference environments is critical.

Even small mismatches can significantly affect recommendation quality.

Debugging Core ML Models

Debugging application code is easy.

Debugging machine learning outputs is harder.

When recommendations look wrong, the issue could be:

  • Data preprocessing
  • Feature encoding
  • Model conversion
  • Training quality

Finding the root cause requires a systematic approach.

What Surprised Me Most

Before this project, I assumed recommendation systems were too heavy for mobile devices.

The reality is that modern phones are extremely capable.

For many personalization use cases, running lightweight recommendation models directly on-device is entirely practical.

The challenge isn't necessarily inference performance.

The challenge is building the surrounding ecosystem:

  • Feature pipelines
  • Embedding storage
  • Ranking strategies
  • Model updates
  • Experimentation frameworks

Key Takeaways

  1. Two-Tower architectures are well-suited for on-device recommendation systems.
  2. Core ML integration is easier than expected once the model is correctly converted.
  3. Model conversion is often the hardest part of the workflow.
  4. On-device inference dramatically reduces latency.
  5. Privacy and offline capabilities become significant advantages.
  6. The surrounding recommendation infrastructure is usually more complex than the inference itself.

Final Thoughts

This project started as an exploration into mobile machine learning and quickly became a deeper lesson in recommendation systems.

As mobile hardware continues to improve, I expect more personalization workloads to move closer to the user.

Running recommendation inference directly on-device won't replace backend recommendation systems entirely, but it opens interesting possibilities for low-latency, privacy-preserving, and offline-first user experiences.

For mobile engineers interested in machine learning, Two-Tower models and Core ML are an excellent place to start.

Top comments (0)