Building a Recommendation Engine Using SageMaker and TensorFlow: Step-by-Step Guide

gcptutorials.com AWS

Introduction

Recommendation engines power personalized experiences across platforms like Netflix, YouTube, and Amazon. In this tutorial, you'll learn to build a production-ready system using Amazon SageMaker and TensorFlow, focusing on the MovieLens dataset. We'll cover everything from data preprocessing to deployment, with code examples suitable for beginners.

Prerequisites

AWS Account (Free Tier eligible)
Basic Python knowledge
Familiarity with Jupyter Notebooks

Step 1: SageMaker Environment Setup

Create a SageMaker Notebook Instance:

# Configure IAM role and session
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

Use an ml.t3.medium instance for cost efficiency. Ensure IAM roles have S3 access for data storage .

Step 2: Data Preparation with MovieLens

Download and preprocess the dataset:

import pandas as pd

# Load 100k MovieLens dataset
ratings = pd.read_csv('ml-latest-small/ratings.csv')
movies = pd.read_csv('ml-latest-small/movies.csv')

# Create user-item matrix
user_ids = ratings['userId'].unique()
movie_ids = ratings['movieId'].unique()

# Split data (80% training, 20% testing)
from sklearn.model_selection import train_test_split
train, test = train_test_split(ratings, test_size=0.2)

We use negative sampling to handle implicit feedback - crucial for real-world scenarios where users don't explicitly rate items .

Step 3: Building the Neural Collaborative Filtering Model

Implement a two-tower architecture with TensorFlow:

import tensorflow as tf
from tensorflow.keras.layers import Embedding, Flatten, Concatenate, Dense

# User embedding tower
user_input = tf.keras.Input(shape=(1,), name='user_id')
user_embedding = Embedding(len(user_ids)+1, 64)(user_input)
user_vec = Flatten()(user_embedding)

# Item embedding tower
item_input = tf.keras.Input(shape=(1,), name='movie_id')
item_embedding = Embedding(len(movie_ids)+1, 64)(item_input)
item_vec = Flatten()(item_embedding)

# Concatenate and add dense layers
concat = Concatenate()([user_vec, item_vec])
dense = Dense(128, activation='relu')(concat)
output = Dense(1, activation='sigmoid')(dense)

model = tf.keras.Model(inputs=[user_input, item_input], outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy')

This architecture learns latent features from user-item interactions, combining matrix factorization with neural networks.

Step 4: Training on SageMaker

Configure TensorFlow estimator:

from sagemaker.tensorflow import TensorFlow

# Upload processed data to S3
train_path = sagemaker_session.upload_data('train.csv', bucket=bucket)
test_path = sagemaker_session.upload_data('test.csv', bucket=bucket)

# Initialize estimator
estimator = TensorFlow(
    entry_point='train.py',
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    framework_version='2.9',
    py_version='py38',
    hyperparameters={'epochs': 10}
)

# Start training job
estimator.fit({'train': train_path, 'test': test_path})

Use SageMaker's managed training to handle distributed computing and spot instances for cost savings.

Step 5: Model Deployment

Create real-time endpoint:

predictor = estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.t2.medium',
    endpoint_name='movie-recommender'
)

# Sample prediction
sample_user = 123
sample_movie = 456
prediction = predictor.predict({
    'user_id': [sample_user],
    'movie_id': [sample_movie]
})[0][0]

print(f"Like probability: {prediction:.2%}")

SageMaker endpoints auto-scale based on traffic and support A/B testing for model updates .

Step 6: Evaluation & Optimization

Calculate precision@k metrics:

from sklearn.metrics import ndcg_score

# Generate top-10 recommendations
test_users = test['user_id'].unique()
recommendations = {}
for user in test_users:
    movies = movies.sample(100)['movie_id'].tolist()
    scores = predictor.predict({'user_id': [user]*100, 'movie_id': movies})
    top_indices = np.argsort(scores)[-10:]
    recommendations[user] = [movies[i] for i in top_indices]

# Calculate NDCG
ndcg = ndcg_score(true_relevance, predicted_scores)
print(f"Model NDCG: {ndcg:.3f}")

Use SageMaker Model Monitor to track data drift and performance metrics in production .