How Rainy Is Seattle

12 minute read

Published:

My backyard received about 11” of snow these past 4 days which is roughly 10.5 more inches than a normal February weekend. In honor of this unusually cold precipitation (it is supposed to RAIN in Seattle…), let’s predict the amount of rain per month for various locations around the Puget Sound. We’ll use Driverless AI: H2O.ai’s answer to automated Machine Learning.

import numpy as np
import pandas as pd

from h2oai_client import Client
import subprocess

import matplotlib.pyplot as plt
%matplotlib inline

Preparing the Data

The city of Seattle provieds many datasets for learning about the Puget Sound. In this post we’ll use the Observed Monthly Rain Gauge Accumulations which tracks the amount of rain in inches each month for 17 locations.

Driverless AI provides many data-munging activities (such as null handling or scaling) automatically. We will need to pivot our data from having a column per rain gauge to multiple rows per rain gauge. Additionally, we will split our data into 4 groups:

  1. Train Data: Records of every rain gauge from 2009 to 2014, this is the data we will train our model(s) on
  2. Training Test Data: Records of every rain gauge for 2015, this is data will will give to our AI for model validation during the training process. This will allow the correct models and features to be selected.
  3. Retrain Data: Records of every rain gauge from 2009 to 2015, this is a combination of our previous two data sets and we will use it to retrain the model from the original training phase
  4. Prediction Data: Records of every rain gauge for 2016, we’re going to pretend like it’s Janurary 1st 2016 and we want to know how damp we’ll be each month this year, the algorithms will never see the correct answer for this data set
# Load data
raw_path = '../../data/Observed_Monthly_Rain_Gauge_Accumulations_-_Oct_2002_to_May_2017.csv'

rain_raw = pd.read_csv(raw_path)
rain_raw = rain_raw.set_index("Date")

rain_raw.head()
RG01RG02RG03RG04RG05RG07RG08RG09RG10_30RG11RG12RG14RG15RG16RG17RG18RG20_25
Date
11/30/20022.433.362.882.480.782.492.572.933.252.382.592.463.062.693.593.173.15
12/31/20024.311.405.464.801.995.062.482.356.484.955.713.575.773.285.776.025.60
01/31/20036.557.355.846.487.574.477.397.315.426.587.585.727.478.329.697.667.17
02/28/20031.611.811.701.491.111.501.561.731.181.371.471.331.191.211.521.091.34
03/31/20035.015.883.125.015.095.155.145.015.684.015.164.575.505.615.625.494.89
# Pivot data
rain_pivot = rain_raw.unstack().reset_index(name="rain_inches")
rain_pivot.rename(columns={'level_0': 'rain_gauge', 'Date': 'end_of_month'}, inplace=True)
# Format date appropriately
rain_pivot['end_of_month'] = pd.to_datetime(rain_pivot['end_of_month'])
print("Earliest Records:\t",rain_pivot['end_of_month'].min())
print("Latest Records:\t\t",rain_pivot['end_of_month'].max())
rain_pivot.head()
Earliest Records:	 2002-11-30 00:00:00
Latest Records:		 2017-05-31 00:00:00
rain_gaugeend_of_monthrain_inches
0RG012002-11-302.43
1RG012002-12-314.31
2RG012003-01-316.55
3RG012003-02-281.61
4RG012003-03-315.01
# Split data into train, test, retrain, and predict
train_py = rain_pivot[(rain_pivot['end_of_month'] >= '2009-01-01') & (rain_pivot['end_of_month'] <= '2015-01-01')]
test_py = rain_pivot[rain_pivot['end_of_month'].dt.year == 2015]
retrain_py = rain_pivot[rain_pivot['end_of_month'] <= '2016-01-01']
predict_py = rain_pivot[rain_pivot['end_of_month'].dt.year == 2016]

# Save to disk
train_py.to_csv('../../data/rain_train.csv')
test_py.to_csv('../../data/rain_test.csv')
retrain_py.to_csv('../../data/rain_retrain.csv')
predict_py.to_csv('../../data/rain_predict.csv')

Building our Initial Model in Driverless AI

Driverless AI comes with a super user friendly GUI, but I’m a coding gal, so we’re going to do all the following from Python. Just keep in mind, from this point on could be 100% code-free, if that’s what you’re into.

Install and run docker & Driverless AI (get a 21 day free trial here!).

We will then:

  1. Connect to DAI
  2. Upload our training and testing datasets
  3. Get suggested tuning parameters and updated any parameters as needed
  4. Let DAI do it’s thing and work out the best models, features, and hyper-parameters
# Any username & password can be used
# When connecting to the GUI, use the same credentials to see all data & experiments 
h2oai = Client(
    address = 'http://localhost:12345'
    , username = 'rain'
    , password = 'rain'
)
# Import training data
train = h2oai.create_dataset_sync('/../../data/rain_train.csv')
# Import model testing data
test = h2oai.create_dataset_sync('/../../data/rain_test.csv')

# Combined train & test to retrain model
retrain = h2oai.create_dataset_sync('/../../data/rain_retrain.csv')

We next get a list of suggested tuning opions, specifically we are interested in the recomendation for:

  • accuracy: how good is our final model at the prediction goal
  • time: how long will our AI for AI run
  • interpretability: how much of a black box will the final model be
# Get a list of suggested parameters from DAI
params = h2oai.get_experiment_tuning_suggestion(
    dataset_key = train.key         #our dataset in H2O
    , target_col = 'rain_inches'    #what we want to predict
    , is_classification = False     #regression vs. classification
    , is_time_series = True         #is this a time based model
    , config_overrides = None       #any changes from the config file
)
params.dump()
{'dataset_key': 'ladacasa',
 'resumed_model_key': '',
 'target_col': 'rain_inches',
 'weight_col': '',
 'fold_col': '',
 'orig_time_col': '',
 'time_col': '',
 'is_classification': False,
 'cols_to_drop': [],
 'validset_key': '',
 'testset_key': '',
 'enable_gpus': True,
 'seed': False,
 'accuracy': 10,
 'time': 4,
 'interpretability': 8,
 'scorer': 'RMSE',
 'time_groups_columns': [],
 'time_period_in_seconds': None,
 'num_prediction_periods': None,
 'num_gap_periods': None,
 'is_timeseries': True,
 'config_overrides': None}
# We will use the suggested accuracy, time, interpreability but have some other changes to make

# Include our testing dataset
params.testset_key = test.key

# Our time and grouping columns
params.orig_time_col = 'end_of_month'
params.time_col = 'end_of_month'
params.time_groups_columns=['end_of_month','rain_gauge']

# Amount of time between each time period, roughly 30 days
params.time_period_in_seconds = 3600*24*30
# Time between and our training and predicting, 0 months
params.num_gap_periods = 0
# Number of months out we want to predict
params.num_prediction_periods = 12 

# DAI will automatically find and ignore IDs, but we're paranoid
params.cols_to_drop = ['C1']                             

# Set a seed for repeatbility
params.seed = 1234

# We have some rainy outliars (like the ELEVEN INCHES OF SNOW IN MY BACKYARD) that we're not TOO concerned with
# So we will optimize for MAE rather than RMSE
params.scorer = 'MAE'

# Confirm we like these
params.dump()
{'dataset_key': 'ladacasa',
 'resumed_model_key': '',
 'target_col': 'rain_inches',
 'weight_col': '',
 'fold_col': '',
 'orig_time_col': 'end_of_month',
 'time_col': 'end_of_month',
 'is_classification': False,
 'cols_to_drop': ['C1'],
 'validset_key': '',
 'testset_key': 'comobogo',
 'enable_gpus': True,
 'seed': 1234,
 'accuracy': 10,
 'time': 4,
 'interpretability': 8,
 'scorer': 'MAE',
 'time_groups_columns': ['end_of_month', 'rain_gauge'],
 'time_period_in_seconds': 2592000,
 'num_prediction_periods': 12,
 'num_gap_periods': 0,
 'is_timeseries': True,
 'config_overrides': None}
# Kick off our experiment 
initial_training_model = h2oai.start_experiment_sync(
    **params.dump()
)

We asked for fairly high accuracy, but kept our time low. Overall, this took about 40 minutes running on my Mac, which is just enough time to go play with my dog in the snow. We could expect faster results if we were not running this locally.

print("Final Model Score on Validation Data: " + str(round(initial_training_model.valid_score, 3)))
print("Final Model Score on Test Data: " + str(round(initial_training_model.test_score, 3)))
Final Model Score on Validation Data: 1.366
Final Model Score on Test Data: 2.105
# Look at scores outside of what we optimized for:
test_diagnostics = h2oai.make_model_diagnostic_sync(initial_training_model.key, test.key)
[{'scorer': x.scorer, 'score': x.score} for x in test_diagnostics.scores]
[{'scorer': 'GINI', 'score': 0.5621926932991879},
 {'scorer': 'R2', 'score': 0.2687532732832531},
 {'scorer': 'MSE', 'score': 6.229572296142578},
 {'scorer': 'RMSE', 'score': 2.495911121368408},
 {'scorer': 'RMSLE', 'score': 0.6275562644004822},
 {'scorer': 'RMSPE', 'score': 4.836535758258022},
 {'scorer': 'MAE', 'score': 2.104685068130493},
 {'scorer': 'MER', 'score': 56.25568389892578},
 {'scorer': 'MAPE', 'score': 245.2017059326172},
 {'scorer': 'SMAPE', 'score': 80.53492736816406}]

Understanding the Initial Model

Two inches of error isn’t particularly great, but Seattle rain varies from 0 to 10 inches throughout the year, so we’re not unimpressed. Let’s see how our model improved over time by replicating a feature of the DAI UI.

# Add scores from experiment iterations
iteration_data = h2oai.list_model_iteration_data(
    initial_training_model.key
    , 0
    , len(initial_training_model.iteration_data)
)
iterations = list(map(lambda iteration: iteration.iteration, iteration_data))
scores_mean = list(map(lambda iteration: iteration.score_mean, iteration_data))
scores_sd = list(map(lambda iteration: iteration.score_sd, iteration_data))

# Add score from final ensemble
iterations = iterations + [max(iterations) + 1]
scores_mean = scores_mean + [initial_training_model.valid_score]
scores_sd = scores_sd + [initial_training_model.valid_score_sd]


plt.figure()
plt.errorbar(iterations, scores_mean, yerr=scores_sd, color = "y", 
             ecolor='yellow', fmt = '--o', elinewidth = 4, alpha = 0.5)
plt.xlabel("Iteration")
plt.ylabel(initial_training_model.scorer)
plt.ylim([0, max(scores_mean) + 1])
plt.show();

png

Next, we’ll look at which features were most important to our model. This could include both original features and those engineered for us by DAI.

# Download Summary of Model
summary_path = h2oai.download(src_path=initial_training_model.summary_path, dest_dir=".")
dir_path = "./h2oai_experiment_summary_" + initial_training_model.key
subprocess.call(['unzip', '-o', summary_path, '-d', dir_path], shell=False)

# View Top Features
features = pd.read_table(dir_path + "/ensemble_features.txt", sep=',', skipinitialspace=True)
features.head(n = 10)
Relative ImportanceFeatureDescription
01.0000031_end_of_month~get_dayofyearDayofyear extracted from 'end_of_month'
10.3519829_TargetLag:20Lag of target for groups [] by 20 time periods
20.3472432_EWMA(0.05)(2)TargetLags:13:17:18:19:20:23:2...EWMA (α = 0.05) of twice differentiated target...
30.2538831_end_of_month~get_monthMonth extracted from 'end_of_month'
40.1664729_TargetLag:13Lag of target for groups [] by 13 time periods
50.1660935_EWMA(0.1)(2)TargetLags:rain_gauge.13:17:18:...EWMA (α = 0.1) of twice differentiated target ...
60.1568432_EWMA(0.05)(1)TargetLags:13:17:18:19:20:23:2...EWMA (α = 0.05) of differentiated target lags ...
70.1539629_TargetLag:17Lag of target for groups [] by 17 time periods
80.1524429_TargetLag:19Lag of target for groups [] by 19 time periods
90.1412835_EWMA(0.1)(1)TargetLags:rain_gauge.13:17:18:...EWMA (α = 0.1) of differentiated target lags [...

We didn’t give our AI a lot to work with, only a date and an amount of rain. Still, it was able to test out many new features.

Our top feature is the day of year, which, since we only have monthly data, is fairly similar to the 4th feature: month. Other important features include the amount of rain for various months last year and several exponetial weighted moving averages which will take into account the amount of rain for various months this year.

Retraining Our Model

Now that we have a base model that has been trained, validated, and tested we will take this model and it’s features and rebuild it on our full training data set. This will prepare us for our final goal of predicting 2016 rain per month!

# Look at model parameters
params_retrain = params
params_retrain.dump()
{'dataset_key': 'ladacasa',
 'resumed_model_key': '',
 'target_col': 'rain_inches',
 'weight_col': '',
 'fold_col': '',
 'orig_time_col': 'end_of_month',
 'time_col': 'end_of_month',
 'is_classification': False,
 'cols_to_drop': ['C1'],
 'validset_key': '',
 'testset_key': 'comobogo',
 'enable_gpus': True,
 'seed': 1234,
 'accuracy': 10,
 'time': 4,
 'interpretability': 8,
 'scorer': 'MAE',
 'time_groups_columns': ['end_of_month', 'rain_gauge'],
 'time_period_in_seconds': 2592000,
 'num_prediction_periods': 12,
 'num_gap_periods': 0,
 'is_timeseries': True,
 'config_overrides': None}
# rebuild from our previous model rather than starting over
params_retrain.resumed_model_key = initial_training_model.key

# train on the train + test dataset
params_retrain.dataset_key = retrain.key
# include no data set
params_retrain.testset_key = ''
full_training_model = h2oai.start_experiment_sync(
    **params_retrain.dump()
)

Making Predictions on New Data

Now that we have our final model we can make predictions on rain for 2016.

# Score on new data set
prediction = h2oai.make_prediction_sync(
    full_training_model.key
    , '/../../data/rain_predict.csv'
    , output_margin = False
    , pred_contribs = False
)
# Download predictions
pred_path = h2oai.download(prediction.predictions_csv_path, '.')
pred_table = pd.read_csv(pred_path)
pred_table.head()
rain_inches
05.667750
12.470040
22.963274
33.453645
43.425199
# Join our predictions to our actauls
# Concat columns as they are in the same order
prediction_py_preds = pd.concat([predict_py.reset_index(drop=True), pred_table], axis=1)
prediction_py_preds.columns = ['rain_gauge','end_of_month','rain_inches_actual','rain_inches_prediction']
prediction_py_preds.head()
rain_gaugeend_of_monthrain_inches_actualrain_inches_prediction
0RG012016-01-318.225.667750
1RG012016-02-294.022.470040
2RG012016-03-316.052.963274
3RG012016-04-301.603.453645
4RG012016-05-311.403.425199
# MAE
np.absolute(prediction_py_preds['rain_inches_prediction'] - prediction_py_preds['rain_inches_actual']).mean()
2.3919339019607837

We have a 2016 MAE that’s about a 1/4 inch worse than when we tested with 2015. A couple inches of rain isn’t wonderful, let’s see what we worst at predicting.

pred_by_rg = prediction_py_preds.groupby(prediction_py_preds['rain_gauge']).apply(
    lambda x: np.absolute(x['rain_inches_prediction'] - x['rain_inches_actual']).mean())

plt.barh(pred_by_rg.index, pred_by_rg, align='center', alpha=0.5)
plt.yticks(pred_by_rg.index)
plt.title('MAE by Rain Gauge')
 
plt.show()

png

Overall, we’re fairly consistent accross the different Seattle locations. How were our predictions for different months?

pred_by_month = prediction_py_preds.groupby(prediction_py_preds['end_of_month'].dt.month).apply(
    lambda x: np.absolute(x['rain_inches_prediction'] - x['rain_inches_actual']).mean())

plt.barh(pred_by_month.index, pred_by_month, align='center', alpha=0.5)
plt.yticks(pred_by_month.index)
plt.title('MAE by Month')
ax = plt.gca()
ax.invert_yaxis()
plt.show()

png

We see that our model was particularly bad for October. When we look at the overall october trends we see that the 2016 rain fall for October was 3” more that the previous rainiest October (2003) So… it’s reasonable we didn’t catch that.

october_data = rain_pivot[rain_pivot['end_of_month'].dt.month == 10]
oct_by_year = october_data.groupby(october_data['end_of_month'].dt.year).mean()

plt.bar(oct_by_year.index, oct_by_year.rain_inches, align='center', alpha=0.5)
plt.title('Amount of October Rain by Year')
plt.show()

png

Conclusions

Overall, in just a couple of hours we built hundereds of models (or at least we facilitated the building of 100s of models), tested many derivered features, and optimized both. Outside of the data prep, this could have been fully ran from the browser version of Driverless AI which would reduce our run time down to ~1.5 hours of us drinking coffee while our computer worked.

We all know the trope that in Seattle it’s ALWAYS raining, but we now know that Seattle is fickle about the AMOUNT of rain. With that in mind, we’re pretty happy with our results :)