This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.
Sourced from https://www.kaggle.com/datasets/arashnic/fitbit
Visualize and predict based on history if a user is increasing their fitness and at what rate. Features the product uses are activity distance, activity minutes, total steps, and calories burned
pip install seaborn==0.11.0
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import os
from sklearn.linear_model import LinearRegression
import datetime as dt
sns.set_theme()
filename = os.path.join(os.getcwd(), "FitabaseData", "dailyActivity_merged.csv")
df = pd.read_csv(filename, header=0)
df.columns
df.head(10)
len(df["Id"].unique()) #33 unique users tracked
df["Id"].value_counts() # days tracked ranges: 4, 18-31
df.shape #940 total data examples
df_summ = df.describe(include = 'all')
df_summ
df["SedentaryActiveDistance"].unique()
# SedentaryActiveDistance does not contribute much value to the total distance feature
# WILL DROP
# Question: does LoggedActivitiesDistance get added into the TotalDistance.
# If so, will drop
df["LoggedActivitiesDistance"].unique()
# testing above question
df["test_total_distance"] = df["TrackerDistance"] + df["LoggedActivitiesDistance"] + df["VeryActiveDistance"] + df["ModeratelyActiveDistance"] + df["LightActiveDistance"] + df["SedentaryActiveDistance"]
total_diff_vals = df["test_total_distance"] == df["TotalDistance"]
total_diff_vals.value_counts()
# conclustion: 85 out of 940 have a different values between the app provided TotalDistance
# and the summed distance features I created.
# due to time constraints, it's a low priority to understand the unique differnces between
# the distance features
# will use just total distance as the main distance feature for the analysis
# question: are TotalDistance and TrackerDistance the same values?
# if so will drop TrackerDist
df["same_val_tot_dist"]= (df["TotalDistance"]==df["TrackerDistance"])
# answer: no there are some values that are not the same:
df[~df["same_val_tot_dist"]]
print(df[~df["same_val_tot_dist"]].shape)
#conclusion: do not use TrackerDistance as a feature.
# only 15 out of 940 are different in value
# purpose/metric of TrackerDistance unknown at this time
sns.histplot(data=df, x="Calories")
sns.histplot(data=df, x="TotalSteps")
Features I will use to analyze fitness progress:
# combining all minute features to create "total_minutes" feature
df["total_minutes"] = df["VeryActiveMinutes"] + df["FairlyActiveMinutes"] + df["LightlyActiveMinutes"]
df["total_minutes"]
df.head()
list of features provided: 'Id', 'ActivityDate', 'TotalSteps', 'TotalDistance', 'TrackerDistance', 'LoggedActivitiesDistance', 'VeryActiveDistance', 'ModeratelyActiveDistance', 'LightActiveDistance', 'SedentaryActiveDistance', 'VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes', 'Calories']
# creating df of features I will use and dropping rest
features_to_drop = ['same_val_tot_dist','TrackerDistance', 'LoggedActivitiesDistance', 'VeryActiveDistance',
'ModeratelyActiveDistance', 'LightActiveDistance', 'SedentaryActiveDistance', 'VeryActiveMinutes',
'FairlyActiveMinutes', 'LightlyActiveMinutes', 'SedentaryMinutes', 'test_total_distance']
df_features = df.drop(columns = features_to_drop, axis = 1, inplace=False)
df_features.head(10)
# converting the date data so it can be read as a number
df_features['Date'] = pd.to_datetime(df_features['ActivityDate'])
df_features['Date'] = df_features['Date'].map(dt.datetime.toordinal)
df_features.head()
#drop the ActivityDate
df_features.drop(columns = 'ActivityDate', axis = 1, inplace=True)
df_features
#randomly select user:
rand_user = random.choice(df_features['Id'])
rand_user
# first I want to get a list of index values for NOT this user
index_list_for_user = df_features[df_features['Id'] != rand_user ].index
index_list_for_user # returns a list of index values in df_features for everyone BUT this user
# now I want to build a dataframe for just that user at those index values:
df_for_user = df_features.drop(index_list_for_user, inplace=False)
df_for_user
print(df_for_user.shape)
#define data
x = np.array(df_for_user['Date'])
y = np.array(df_for_user['Calories'])
#find line of best fit
a, b = np.polyfit(x, y, 1)
#add points to plot
plt.scatter(x, y, color='purple')
#add line of best fit to plot
plt.plot(x, a*x+b, color='steelblue', linestyle='--', linewidth=2)
#define data
x = np.array(df_for_user['Date'])
y = np.array(df_for_user['TotalDistance'])
#find line of best fit
a, b = np.polyfit(x, y, 1)
#add points to plot
plt.scatter(x, y, color='purple')
#add line of best fit to plot
plt.plot(x, a*x+b, color='steelblue', linestyle='--', linewidth=2)
#define data
x = np.array(df_for_user['Date'])
y = np.array(df_for_user['TotalSteps'])
#find line of best fit
a, b = np.polyfit(x, y, 1)
#add points to plot
plt.scatter(x, y, color='purple')
#add line of best fit to plot
plt.plot(x, a*x+b, color='steelblue', linestyle='--', linewidth=2)
#define data
x = np.array(df_for_user['Date'])
y = np.array(df_for_user['total_minutes'])
#find line of best fit
a, b = np.polyfit(x, y, 1)
#add points to plot
plt.scatter(x, y, color='purple')
#add line of best fit to plot
plt.plot(x, a*x+b, color='steelblue', linestyle='--', linewidth=2)
I would have liked to research how to implement Python Dash so that I could create a dashboard showing these visualizations in a more cohesive way