Back
Science

Study Develops and Evaluates AI Models for Insulin Resistance Prediction Using Wearable and Blood Biomarker Data

View source

Study Overview and Recruitment

A study protocol, approved by Advarra (IRB no. Pro00074093), leveraged Google Health Services (GHS) and Fitbit applications for comprehensive participant engagement. GHS played a pivotal role in facilitating enrollment, conducting eligibility checks, obtaining informed consent, and collecting crucial data from wearables such as Fitbit devices and Pixel watches.

Participants were also able to complete questionnaires and order blood tests through Quest Diagnostics, a process that necessitated a signed Quest HIPAA authorization.

The study successfully enrolled 4,416 participants across the USA, with 1,165 individuals providing complete data for analysis.

Remarkably, the study was conducted predominantly remotely, requiring only a single in-person visit to a Quest Patient Service Center for a blood draw.

Inclusion and Exclusion Criteria

The study established specific criteria to ensure a relevant and manageable participant pool.

Inclusion Criteria

Participants were required to meet the following conditions:

  • Reside in the USA and be aged 21–80.
  • Be Android Fitbit users with heart-rate-sensing Fitbit devices or Pixel watches.
  • Possess at least three months of existing device data, with device usage on >= 75% of days to track activity and sleep.
  • Be willing to update both Fitbit and GHS Android apps.
  • Be willing to link or create a Quest Diagnostics account via GHS.
  • Be able to speak and read English, and provide informed consent and HIPAA authorization.
  • Have access to and be willing to visit a Quest Diagnostics location for blood draws.

Exclusion Criteria

Certain conditions led to participant exclusion:

  • Residing in Alaska, Arizona, Hawaii, and US territories due to Quest Diagnostics service limitations.
  • Having an uncontrolled disease (e.g., recent treatment changes, unresponsive symptoms).
  • Having conditions that made venipuncture impractical.

Study Design and Data Collection

Participants began by linking their Fitbit accounts to the GHS app, granting GHS permission to collect Fitbit data, including up to three months prior to enrollment.

Post-Enrollment Tasks

After enrollment, participants undertook several key tasks:

  1. Wearable Use: Continuously wear Fitbit devices/Pixel watches (at least three out of four days).
  2. Questionnaires: Complete four questionnaires covering demographics, health history, health perception, and blood test interpretation.
  3. Blood Draw Scheduling: Schedule and complete a blood draw at a Quest Patient Service Center within 65 days of enrollment.
  4. Blood Draw Completion: Undergo the blood draw.
  5. Result Review: Review blood test results in the GHS app.

Metadata Collection

Baseline questionnaires gathered crucial metadata, including:

  • Demographics: Age, gender, ethnicity, weight, height.
  • Optional Measures: Medical conditions (e.g., diabetes, hyperlipidemia), blood pressure, waist circumference, medications, self-reported health management, and habits.

Blood Biomarker Measurements

Participants completed a standard fasting blood draw (8 hours, 07:00–10:00 local time) at Quest Diagnostics. The comprehensive measurements included:

  • Complete blood count
  • CMP (Comprehensive Metabolic Panel)
  • Insulin
  • Total cholesterol
  • Triglycerides
  • HDL cholesterol
  • Calculated LDL cholesterol
  • HbA1c
  • High-sensitivity CRP
  • Hepatic panel
  • Gamma-glutamyl transferase (GGT)
  • Total testosterone

Results were securely transferred from Quest Diagnostics to the GHS app for participant review but were not stored within the app.

Wearable Data

Participants shared a rich array of data from their Fitbit or Pixel watches, encompassing:

  • Heart-rate metrics: Heart rate, Resting Heart Rate (RHR), Interbeat Interval (IBI), Heart Rate Variability (HRV) metrics.
  • Physical activity metrics: Steps, floors, Active Zone Minutes (AZMs).
  • Sleep metrics: Bed time, wake time, sleep duration, sleep stages, sleep quality.
  • Respiration and skin temperature: Respiration rate, skin temperature.
  • Blood oxygen saturation (SpO2).
  • Weight.
  • Exercise and activity data.

Modeling and Computational Pipeline

HOMA-IR Thresholds

Insulin resistance (IR) was precisely classified using specific Homeostatic Model Assessment for Insulin Resistance (HOMA-IR) thresholds:

  • 1.5 to 2.9 for impaired-IS (insulin sensitivity)
  • >= 2.9 for IR

Methodology

The methodology comprised four distinct stages, forming a robust computational pipeline:

  1. Data preprocessing.
  2. Modeling and training.
  3. Prediction and classification of HOMA-IR.
  4. LLM-based interpretation.

Data Preprocessing

Rigorous data preprocessing was essential for model accuracy:

  • Age and BMI: Users with BMI > 65 or < 12 were excluded.
  • Digital markers: Aggregated mean, standard deviation, and median values from wearables were calculated over 7, 14, 30, or 60 days.
  • Blood biomarkers: Non-fasting participants, those with missing values, or HOMA-IR >= 15 were excluded.
  • Data standardization: Input features were standardized to zero mean and unit variance using training subset data.

Modeling Approaches

The study employed several sophisticated modeling approaches:

  1. Direct Regression: Utilized XGBoost (gradient boosting machines) with both linear and non-linear learners to predict continuous HOMA-IR values.
  2. Representation Learning: Employed Masked Autoencoders (MAEs) for self-supervised representation learning, followed by training a linear XGBoost model on the learned representations.
  3. Wearable Foundation Models (WFMs): Leveraged a pretrained large sensor model (LSM-2-style) on 40 million person-hours of Fitbit data. This model was specifically designed to handle complex missingness patterns in sensor data, with its output embedding then serving as input for an XGBoost head.

Model Training

Both MAE and XGBoost models were trained for 500 epochs using the Adam optimizer. A masking probability of 0.75 was used for the MAE model. Hyperparameters were meticulously optimized via grid search with fivefold cross-validation.

Independent Validation Cohort

An independent validation cohort was drawn from a separate 30-week longitudinal study, approved by WCG (IRB no. 1371945). This study focused on assessing lifestyle modifications on cardiometabolic health and involved in-person visits, collecting anthropometric measurements, blood biomarkers, Fitbit Charge 6 data, and health questionnaires.

Of 144 enrolled individuals, 72 had complete wearable device and physiological biomarker data and were utilized for validation. Data from the final study time point were used for this assessment.

Evaluation of Prediction and Classification

Regression Evaluation

Performance for continuous HOMA-IR predictions was rigorously evaluated using mean absolute error (MAE) and R2 scores.

IR Classification Evaluation

For binary classification, individuals with HOMA-IR values >= 2.9 were classified as insulin resistant. Key metrics included specificity, sensitivity, precision, AUROC, and AUPRC.

Time-Dependent Sensitivity

To assess robustness, time-dependent aggregation was analyzed using rolling windows of 7, 14, 30, and 60 days, reporting prediction consistency and coefficient of variation.

LLM-based IR Agent

An innovative LLM-based ReAct (reasoning and acting) agent was developed with the purpose of interpreting HOMA-IR results, answering follow-up questions, and providing tailored recommendations.

This agent dynamically plans, gathers information (e.g., via web search), and interacts with various tools to ensure accurate computations and personalized advice.

IR Agent Toolbox

The agent's comprehensive toolbox included:

  • Grounding tools: Google Search.
  • Arithmetic tools.
  • A Python sandbox.
  • The HOMA-IR prediction models, ensuring accurate and deterministic computations.

Evaluation of the IR Agent

Human expert endocrinologists conducted a blind evaluation of the IR agent's responses against a base LLM (Gemini 2.0 Flash). Evaluations focused on two main aspects:

  • Side-by-side comparison: Assessing comprehensiveness, trustworthiness, and personalization.
  • Absolute accuracy: Evaluating factuality, correct referencing and interpretation of personal data, safety, and grounding of citations.

Statistical Analysis and Visualization

Statistical significance was rigorously determined using two-sided Wilcoxon rank-sum tests with Benjamini–Hochberg adjusted P values. Various Python libraries were employed for efficient data processing, modeling, and visualization throughout the study.