Back
Technology

Conversations from ShareGPT Vicuna Unfiltered Dataset Filtered, Transformed, and Used for Warmth Fine-Tuning

View source

Warmth Fine-Tuning as Persona Training: A New Approach to AI Alignment

Dataset Construction

The study employed a carefully curated dataset, selecting conversations from the ShareGPT Vicuna Unfiltered corpus. Content was rigorously filtered to remove 'not safe for work' material using the Detoxify classifier. The remaining conversations were categorized by query type—refusal, factual, creative, technical, advice, and other—using regular expressions.

Equal sampling across categories yielded 1,617 conversations containing 3,667 model responses. Conversations exceeding 20 turns were truncated for consistency.

Each model response was then transformed into a warmer variant using GPT-4o-2024-08-06, preserving original meaning, content, and factual accuracy. To ensure quality, a random sample of 50 transformed messages was compared against the original dataset.

Warmth Fine-Tuning as Persona Training

Four open-weight models underwent fine-tuning using low-rank adaptation (LoRA) with specific parameters: rank r=8, alpha α=16, dropout 0.1, and learning rate η=1e-5. Training used a max sequence length of 1024 tokens with an effective batch size of 16 achieved via gradient accumulation.

The training spanned 10 epochs, with checkpoints saved at epochs 0.5, 1, 1.5, 2, 4, 6, 8, and 10. Hyperparameters remained identical for both warm and cold fine-tuning. Additionally, GPT-4o was fine-tuned via OpenAI's API using full parameter fine-tuning, with learning-rate multipliers of 0.25 for the warm model and 0.1 for the cold model. Checkpoints were saved at epochs 1, 2, 6, and 10 for the warm model.

Validation and Warmth Assessment

A validation set of 1,500 prompts from the same dataset (with no overlap) was categorized by query type using regex. Responses from each model checkpoint were evaluated using SocioT Warmth, a metric based on likelihood comparisons of text preceded by warm versus cold relational contexts, employing GPT-2 and bootstrap sampling (n=100).

Evaluation Tasks

The evaluation employed four benchmark datasets: TriviaQA, TruthfulQA, MASK Disinformation (Disinfo), and MedQA. MedQA prompts were converted to conversational queries. Researchers sampled 500 prompts from TriviaQA, TruthfulQA, and MedQA, while all 125 prompts from Disinfo were used. Open-ended responses were collected for analysis.

Amendment Methodology

Five statements were created for each category: emotional state, relational dynamics, and interaction stakes. These statements were randomly assigned to prompts. For sycophancy tests, incorrect user beliefs were appended. This resulted in 18 total conditions per dataset (9 contextual × 2 user belief). Parameters included temperature 0.8 and max tokens 300 for open-ended tasks, with temperature 0.2 used for MMLU and GSM8K evaluations.

Evaluating Sycophancy

Sycophancy was defined as outputs that affirm users' stated beliefs regardless of their correctness. Incorrect user beliefs were appended to prompts, enabling within-question comparisons to measure shifts toward the stated belief. Difference scores were used to isolate user-influenced answer changes.

Scoring Methodology

Model responses were evaluated using GPT-4o-2024-08-06 as an LLM judge (temperature 0). Refusals were identified via regex, and human annotations validated the scoring on 470 outputs. MMLU and GSM8K were scored using regex patterns.

Descriptive Analysis

Paired comparisons between original and warm models were conducted using McNemar's exact tests with Benjamini-Hochberg correction. Effect sizes were calculated using Cohen's g and odds ratios.

Inferential Analysis

The study analyzed 439,792 observations across 10 models, 4 datasets, and 18 conditions. Fixed-effects logistic regressions tested main effects and interactions, with the binary outcome being incorrect (1) versus correct (0). With alpha set at 0.05, four models were fitted: one for main effects, one for the interaction between fine-tuning and interpersonal context, and one for the interaction between fine-tuning and user belief prompts.