📋

Beyond Scale Language Data Diversity CLAUDE.md

CLAUDE.md for the Beyond Scale Language Data Diversity project (Python).

by @alycialee

Sourced from alycialee/beyond-scale-language-data-diversity — Apache-2.0

CLAUDE.md

> Sourced from [alycialee/beyond-scale-language-data-diversity](https://github.com/alycialee/beyond-scale-language-data-diversity) — [Apache-2.0](https://github.com/alycialee/beyond-scale-language-data-diversity/blob/958dc4badf708a0f767f2916f5c27b824644cf58/CLAUDE.md).

# CLAUDE.md

@/dfs/scratch0/brando9/CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Purpose

Official implementation of the **Task2Vec Diversity Coefficient** — a metric for measuring natural language data diversity. The paper shows LLMs are pre-trained on formally diverse data.

- **Paper:** https://arxiv.org/abs/2306.13840
- **Core idea:** Compute Task2Vec embeddings (diagonal of Fisher Information Matrix) for dataset batches, then measure pairwise cosine distances to get a diversity coefficient.
- **Probe network:** GPT-2 (by default)
- **Supported datasets:** C4, WikiText-103, The Pile (and its sub-datasets), GINC (synthetic)

---

## Installation

```bash
# Conda (recommended)
conda create -n beyond_scale_div_coeff python=3.11 -y
conda activate beyond_scale_div_coeff
pip install -e ~/beyond-scale-language-data-diversity

# Or venv
python3.11 -m venv ~/.virtualenvs/beyond_scale_div_coeff
source ~/.virtualenvs/beyond_scale_div_coeff/bin/activate
pip install -e ~/beyond-scale-language-data-diversity
```

The `install.sh` script installs via conda and also sets up dependencies (`ultimate-utils`, `ultimate-anatome`).

---

## Running Experiments

**Compute diversity coefficient (main workflow):**
```bash
python src/diversity/main.py \
  --task_name c4        # or wikitext, the_pile
  --num_tasks 200 \
  --batch_size 512 \
  --buffer_size 500000 \
  --finetune --pretrained \
  --output_dir ./output_dir \
  --cache_dir ./cache_dir
```

**Batch runners (used in paper):**
```bash
# C4, WikiText-103, The Pile — 200 tasks each
bash src/diversity/scripts/runner.sh

# Individual Pile sub-datasets
bash src/diversity/scripts/runner_thepile_subdataset.sh

# GINC diversity
bash src/diversity/scripts/runner_ginc.sh
```

**GINC synthetic data:**
```bash
# Generate datasets (HMMs with varying symbols)
bash src/ginc/scripts/runner_generate.sh

# Train GPT-2 on GINC
bash src/ginc/scripts/runner_train.sh

# Or directly:
python src/diversity/main_ginc.py \
  --batch_size 512 --finetune --pretrained \
  --cache_dir ./cache_dir --n_hmms=10 --n_symbols=50
```

**Model training (LLaMA-2, GPT-2, Mistral via HF Trainer + TRL/PEFT):**
```bash
python src/training/train.py
python src/training/eval.py
```

---

## Architecture

### `src/diversity/` — Core module

| File | Role |
|------|------|
| `main.py` | CLI entry point; loads dataset, runs Task2Vec embedding loop |
| `div_coeff.py` | `get_diversity_coefficient()`, `cross_diversity_coefficient()` — main API |
| `task2vec.py` | Task2Vec class; computes diagonal FIM via montecarlo/variational/autoregressive methods |
| `task_similarity.py` | `pdist()` (pairwise cosine), `stats_of_distance_matrix()`, `plot_distance_matrix()` |
| `data_mixtures.py` | Mixture definitions: Uniform, DoReMi, LLaMA v1 for C4+WikiText; 5-subset Pile |
| `utils.py` | `AverageMeter`, `get_error()` (autoregressive loss), `seed_everything()` |
| `main_ginc.py` | GINC-specific entry point |
| `scripts/` | Shell runners for paper experiments |
| `notebooks/` | `plot.ipynb` reproduces paper figures; `plot-ginc.ipynb` for GINC plots |

`get_diversity_coefficient()` returns a dict with: `div_coeff`, `div_coeff_ci`, `embeddings`, `distance_matrix`, `losses`, `num_batches`.

Comments marked `## LLM DIV` indicate modifications from the original CV Task2Vec for language model use.

### `src/alignment/` — Alignment/relevance coefficients

- `align_t2v_coeff.py`: `relevance_coeff_task2vec_via_full_embed_dataset()`, `alignment_with_diversity_coefficient()`
- `_align.py`: Alignment framework

### `src/training/` — LM fine-tuning

HuggingFace Trainer + TRL/PEFT (LoRA/QLoRA). Supports GPT-2, LLaMA-2, Mistral, C4, UDACA PileSubsets.

### `src/ginc/` — Synthetic in-context learning data

Generates datasets using HMMs with varying number of symbols/HMMs. Has its own conda env (`conda-env.yml`) and runner scripts.

### `src/data_analysis/` — Paper figures

Scripts correlating diversity coefficient vs. cross-entropy loss and perplexity (R² analysis).

---

## Key API Usage

```python
from diversity.div_coeff import get_diversity_coefficient

results = get_diversity_coefficient(
    dataset,
    probe_network,  # typically GPT-2
    num_tasks=200,
    batch_size=512,
)
print(results['div_coeff'], results['div_coeff_ci'])
```

---

## SNAP Cluster Notes

- Repo lives on DFS: `/dfs/scratch0/brando9/beyond-scale-language-data-diversity`
- LFS symlink: `~/beyond-scale-language-data-diversity`
- GPU selection: `main_krbtmux.sh` handles free-GPU discovery and tmux session setup
- Outputs (embeddings, distance matrices, loss files) go to `./output_dir` by default — use LFS paths for speed
- wandb tracking is used in GINC training scripts

Add to your project

Paste into your project's CLAUDE.md or ~/.claude/CLAUDE.md for global rules.

More for Python

🐍

Python FastAPI Expert

by @Claude Rules

Building high-performance REST APIs with FastAPI, Pydantic, and async Python.

PythonFastAPIBackend

🎸

Django Web Framework

by @Claude Rules

Full-stack Django development with DRF, proper models, and security best practices.

PythonDjangoBackend

📋

Mindx CLAUDE.md

by @DotNetAge

一个可自主进化的数字化分身

ReactTypeScriptPythonTailwind CSSGo

📋

Minimal Second Brain CLAUDE.md

by @gokhanarkan

A minimal, AI-native Obsidian vault template. 3 folders, manifest files for Claude/Copilot, automated maintenance.

Python

📋

Awesome Claude Notes CLAUDE.md

by @loulanyue

🏄 帮助你快速搭建 Claude Code 与 AI Agent 生产力工作流的实战仓库 🏄 A practical toolkit to help you quickly build high-productivity workflows for Claude Code and AI agents

PythonNode.jsJavaScript

📋

N8n Install CLAUDE.md

by @kossakovsky

🚀 Self-hosted AI automation platform. Deploy n8n, Ollama, Flowise, RAG, Supabase & 30+ tools with one command. Auto HTTPS. Free Zapier/Make alternative.

PythonRustJavaDockerPostgreSQLRedisSupabaseShell

MCP servers for Python

microsoft/markitdown

🎖️ 🐍 🏠 - MCP tool access to MarkItDown -- a library that converts many file formats (local or remote) to Markdown for LLM consumption.

mindsdb/mindsdb

Connect and unify data across various platforms and databases with .

FastMCP

🐍 - A high-level framework for building MCP servers in Python

Browse all MCP servers →

Browse by Tag

Python

Get the Claude Code Starter Pack

Top CLAUDE.md rules for Next.js, TypeScript, Python, Go, and React — free.