Amazon SageMaker eliminates infrastructure complexity for machine learning projects, making it ideal for developers transitioning to ML. This tutorial walks you through creating a predictive model from scratch using SageMaker's managed environment—no prior cloud experience required. We'll use the California Housing Dataset to predict home values while explaining core ML concepts.
Objective: Launch a SageMaker notebook instance and configure IAM permissions.
ml.t3.medium
instance type (cost-effective for
experimentation)
import sagemaker
session = sagemaker.Session()
bucket = session.default_bucket()
role = sagemaker.get_execution_role()
print(f"S3 Bucket: {bucket}")
Pro Tip: Always stop instances via the console when not in use to avoid charges.
Dataset: California Housing Prices from scikit-learn
from sklearn.datasets import fetch_california_housing
import pandas as pd
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['MedHouseVal'] = housing.target
# Exploratory Analysis
print(df.describe())
df.hist(figsize=(12, 10))
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2)
train_df.to_csv('train.csv', index=False)
test_df.to_csv('test.csv', header=False, index=False)
Why This Matters: Proper data splitting prevents overfitting. SageMaker requires header-less test files for batch transforms.
Architecture: Use SageMaker's Built-in Algorithm
train_path = session.upload_data(path='train.csv', bucket=bucket, key_prefix='data')
test_path = session.upload_data(path='test.csv', bucket=bucket, key_prefix='data')
from sagemaker.image_uris import retrieve
xgboost_image = retrieve('xgboost', session.boto_region_name, '1.2-1')
estimator = sagemaker.estimator.Estimator(
image_uri=xgboost_image,
role=role,
instance_count=1,
instance_type='ml.m5.large',
output_path=f's3://{bucket}/models',
sagemaker_session=session
)
estimator.set_hyperparameters(
objective='reg:squarederror',
num_round=100,
max_depth=5,
eta=0.2,
gamma=4
)
estimator.fit({'train': train_path})
Training typically completes in 5-7 minutes. Monitor progress in SageMaker Console > Training Jobs.
Production Setup: Create Real-Time Endpoint
predictor = estimator.deploy(
initial_instance_count=1,
instance_type='ml.t2.medium',
endpoint_name='housing-predictor'
)
import numpy as np
sample = test_df.iloc[0:1].drop('MedHouseVal', axis=1).values.tolist()
result = predictor.predict(np.array(sample))
print(f"Predicted Value: {result[0]:.2f} | Actual Value: {test_df['MedHouseVal'].iloc[0]:.2f}")
Cost Alert: Delete endpoints immediately after testing:
predictor.delete_endpoint()
transformer = estimator.transformer(
instance_count=1,
instance_type='ml.m5.large',
strategy='MultiRecord',
assemble_with='Line'
)
transformer.transform(test_path, content_type='text/csv')
transformer.wait()
from sklearn.metrics import mean_squared_error
predictions = pd.read_csv(transformer.output_path + 'test.csv.out', header=None)
mse = mean_squared_error(test_df['MedHouseVal'], predictions)
print(f"Model MSE: {mse:.4f}")
Tuning Tip: Use SageMaker Hyperparameter Tuning Jobs for automated optimization.
You've now completed the full ML pipeline in SageMaker. To deepen your expertise:
Access the complete code in the AWS SageMaker Examples Repository.
Category: AWS
Similar Articles