Using Longitudinal Data

This example demonstrates how to do cross validation with longitudinal data. By longitudinal , we mean predictors for specific samples that have more than one value. This could be time series data or any other sequential data we might want to model.

Example Patient Data

First, we generate some example data and store it using this script: https://github.com/lacava/feat/blob/master/docs/examples/longitudinal/generate_example_longitudinal_data.py .

Let’s imagine we have patient data from a hospital. This means we have measurements from different visits, with different numbers of measurements from different patients collected in non-uniform intervals. In this example, we make up a risk model in which risk increases for a patient with an increasing body mass index (BMI) and a high maximum glucose level in their blood panel.

[1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

random_state=42

df = pd.read_csv('data/d_example_patients.csv')
df.drop('id',axis=1,inplace=True)
X = df.drop('target',axis=1)
y = df['target']
kf = KFold(n_splits=3,shuffle=True,random_state=random_state)

longitudinal format

FEAT expects longitudinal data in the following format:

Z = {
    'variable': ([patient1_values, patient2_values], [patient1_times, patient2_timestamps])
}

Longitudinal data is a dictionary in which the keys are the variable names and the values are tuples. The first element of the tuple contains observations, and the second element contains corresponding time stamps for those observations. The observations and timestamps are expected to be lists, with one element for each patient. Each patient element contains all of the observations or time stamps for that patient.

On the c++ side, FEAT interprets this into the following format:

typedef std::map<string,
                 std::pair<vector<Eigen::ArrayXf>, vector<Eigen::ArrayXf>>
                > LongData;

Although a little clunky, the goal is to store patient-specific values in arrays under the hood to allow for as much SIMD optimization as possible when evaluating operators.

Here is the longitudinal data we generated:

[2]:
zfile = 'data/d_example_patients_long.csv'
zdf = pd.read_csv('data/d_example_patients_long.csv')
zdf
[2]:
id name date value
0 0 bmi 2515 40.000000
1 0 age 2515 33.000000
2 0 glucose 2515 0.058084
3 0 bmi 2979 40.000000
4 0 age 2979 34.000000
... ... ... ... ...
147250 992 age 71187 213.000000
147251 992 glucose 71187 0.388632
147252 992 bmi 72149 87.000000
147253 992 age 72149 215.000000
147254 992 glucose 72149 0.426824

147255 rows × 4 columns

As you can see it’s in a long tabular format. Below we convert it to the FEAT input format.

[3]:
dates = zdf.set_index(['name','id'])['date']
values = zdf.set_index(['name','id'])['value']
# zdf.set_index(['name','date']).to_dict(orient='tight')
values.to_dict()
Z = {}
for name, zg in zdf.groupby('name'):
    values = [
        zgid['value'].values for _,zgid in zg.groupby('id')
    ]
    timestamps = [
        zgid['date'].values for _,zgid in zg.groupby('id')
    ]
    Z[name] = (values, timestamps)

Next we set up the learner. We need to declare the longitudinal operators we want to search over. They are defined as a comma-delimited list of strings using the functions argument. In this case, the operators on the second row of the declaration below all operate on longitudinal data.

[2]:
from feat import Feat

clf = Feat(max_depth=5,
           max_dim=5,
           gens = 10,
           pop_size = 100,
           max_time = 30, # seconds
           verbosity=0,
           shuffle=True,
           normalize=False, # don't normalize input data
           functions=['and','or','not','split','split_c',
                     'mean','median','max','min','variance','skew','kurtosis','slope','count'
                     ],
           backprop=True,
           batch_size=10,
           iters=10,
           random_state=random_state,
           n_jobs=1,
           simplify=0.01    # prune final representations
          )

Cross validation

Cross validation works a little bit differently with longitudinal data. The block below shows how to train a model using Kfold cross validation.

[3]:
scores=[]

for train_idx, test_idx in kf.split(X,y):
    # print('train_idx:',train_idx)
    # note that the train index is passed to FEAT's fit method
    Ztrain = {k:([v[0][i] for i in train_idx], [v[1][i] for i in train_idx]) for k,v in Z.items()}
    Ztest = {k:([v[0][i] for i in test_idx], [v[1][i] for i in test_idx]) for k,v in Z.items()}
    clf.fit(X.loc[train_idx],y.loc[train_idx],Ztrain)
    scores.append(clf.score(X.loc[test_idx],y.loc[test_idx],Ztest))

print('scores:',scores)
scores: [0.0019200502112349603, 0.0013323824931506492, 0.020377830997444116]

Model Interpretation

Now let’s fit a model to all the data and try to interpret it.

[4]:
# fit to all data

print('fitting longer to all data...')
clf.gens = 100
clf.fit(X,y,Z)
fitting longer to all data...
[4]:
Feat(backprop=True, batch_size=10, feature_names='sex,race',
     functions='and,or,not,split,split_c,mean,median,max,min,variance,skew,kurtosis,slope,count',
     max_depth=5, max_dim=5, max_time=30, normalize=False, random_state=42,
     simplify=0.01, verbosity=2)

To see the learned representation, we run clf.get_representation() :

[5]:
clf.get_representation()
[5]:
'[slope(z_bmi)][max(z_glucose)]'

Here our final representation is composed of slope(z_bmi) and max(z_glucose) , both of which we know to be correct features for this simulated dataset. The best training representation displays clear overfitting, highlighting the importance of using archive validation for model selection. We can also look at the representation with the model weights, sorted by magnitude, using clf.get_model() :

[6]:
print(clf.get_model())
Weight  Feature
8.23132 offset
0.80    slope(z_bmi)
0.73    max(z_glucose)

View runtime stats

FEAT stores statistics about the training procedure in a dictionary clf.stats_ . An example of plotting from this dictionary is shown below.

[7]:
clf.stats_.keys()
[7]:
dict_keys(['generation', 'med_complexity', 'med_dim', 'med_loss', 'med_loss_v', 'med_num_params', 'med_size', 'min_loss', 'min_loss_v', 'time'])
[8]:
import matplotlib.pyplot as plt
plt.plot(clf.stats_['time'], clf.stats_['min_loss'], 'b', label='training loss')
plt.plot(clf.stats_['time'], clf.stats_['min_loss_v'], 'r', label='validation loss')
plt.legend()
plt.xlabel('Time (s)')
plt.ylabel('MSE')
plt.gca().set_yscale('log')
plt.gca().set_xscale('log')
../_images/examples_longitudinal_19_0.png
[9]:
import matplotlib.pyplot as plt
plt.plot(clf.stats_['time'], clf.stats_['med_complexity'], 'b', label='median complexity')
# plt.plot(clf.stats_['time'], clf.stats_['med_size'], 'r', label='median size')
plt.legend()
plt.xlabel('Time (s)')
plt.ylabel('Median Complexity')
# plt.gca().set_yscale('log')
plt.gca().set_xscale('log')
../_images/examples_longitudinal_20_0.png

Visualizing the representation

Here we take the two relevant features and plot the data with them. This shows us the risk surface as a function of these learned features.

[10]:
# Visualize the representation

proj = clf.transform(X,Z)

print('proj:',proj.shape)
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patheffects as PathEffects
from matplotlib import cm

cm = plt.cm.get_cmap('RdBu')
# We choose a color palette with seaborn.
# palette = np.array(sns.color_palette("cividis", np.unique(y)))

# We create a scatter plot.
f = plt.figure(figsize=(6, 6))
ax = plt.subplot(aspect='equal')
sc = ax.scatter(proj[:,0], proj[:,1], lw=0, s=20,
                c=y, cmap=cm)
plt.colorbar(sc)
# sc.colorbar()
ax.axis('square')
# ax.axis('off')
ax.axis('tight')

# add labels from representation
rep = [r.split('[')[-1] for r in clf.get_representation().split(']') if r != '']
print('rep:',rep)
plt.xlabel(rep[0])
plt.ylabel(rep[1])

# plt.savefig('longitudinal_representation.svg', dpi=120)
plt.show()
proj: (993, 2)
rep: ['slope(z_bmi)', 'max(z_glucose)']
../_images/examples_longitudinal_22_1.png