ultratree

UltraTree

Results for the ultrametric tree-based, explainable, solar-powered language model.

These charts update each day.

Interactive Next-Word Tree

This animation uses a tiny, inspectable decision tree to predict a next word from a sentence context. The default sentence comes from ../papers/ultratrees/ultratrees.tex.

Sentence context

Ready.

Rule semantics are illustrative and intentionally human-readable (not the production model).

Predicted next word: ?

Leaf candidates: -

Key Charts

Levels of carefulness

Training up an ultrametric tree by finding the optimal split at each step is computationally prohibitive. We can only subsample. Each order of magnitude increase in carefulness is roughly three orders of magnitude more compute time required. "Sense Annotated 1" is the alias of the first training of Careful1000, which seems like a reasonable compromise. It requires about 100 times as many nodes to achieve the same result as Careful10000, but it can train 1000 times faster.

Careful100 and Careful10 are much, much faster to train, but there's a threshold somewhere between Careful100 and Careful1000 where there are too many bad choices. It's open question what that threshold is, and why a threshold even exists.

Carefulness Levels

Does sense annotation work?

The key question that this work set out to answer was whether sense annotation, and indeed, the whole idea of synergistic semantic and statistical models were worth exploring.

The "Unannotated Model 1" can be seen as being a baseline statistical model. It's equivalent to a one-hot encoded decision tree. The sense annotated model's learning generalises where the unannotated model is overfitting very early.

Sense annotated vs Unannotated

Reproducibility and variance in models

Broadly speaking, re-training on the same data yields similar results. Loss on the hold-out training data goes down, roughly linearly with the logarithm of the number of nodes in the model. Note that these are only sorted by time (the model that was trained first). It's just co-incidence that model 1 is the best and model 5 the worst.

Even the worst model is doing much better than the unannotated model. The probability of this happening by chance is 1/32, which is equivalent to a p-value of 0.03.

Reproducibility of model training

Ensembling

Ensembling doesn't work. For the same parameter count the loss is worse.

Total Loss vs Model Size for Sense Annotated Models Including Ensembling

Baseline Comparison

This compares the UltraTree loss curves (shown here for Careful10000 and Sense Annotated 1) against neural baselines, plotting everything against "parameter count" (UltraTree node count vs neural trainable parameters). Parameter-efficiency depends strongly on the UltraTree training regime, so this plot should be read as a comparison of these specific runs, not as a fundamental limit.

Weirder is that here sense annotation makes barely any difference to the neural network models.

Careful 10000 vs Sense Annotated 1 vs Neural Networks

How many UltraTree parameters match a neural network?

This chart answers: for a given neural parameter budget, how many UltraTree nodes are needed to match the best neural mean-loss result (loss per token). Solid points are directly observed in evaluation data; dotted points are extrapolated because neural loss is better than the best observed UltraTree loss. The UltraTree curve here is taken from the loss frontier across all UltraTree training regimes.

Overall and per-carefulness estimates:

All regimes (frontier across all UltraTree runs): median nodes/param = 0.038129700. UltraTrees need 26.226274920x fewer parameters than neural networks to match the same mean loss (per token). (6 observed, 12 projected)
Careful10: insufficient frontier matches to compute a median.
Careful100: median nodes/param = 1248.820548494. UltraTrees need 1248.820548494x more parameters than neural networks to match the same mean loss (per token). (1 observed, 17 projected)
Careful1000 (Sense Annotated 1): median nodes/param = 0.027585785. UltraTrees need 36.250554324x fewer parameters than neural networks to match the same mean loss (per token). (6 observed, 12 projected)
Careful10000: median nodes/param = 0.000457543. UltraTrees need 2185.587649703x fewer parameters than neural networks to match the same mean loss (per token). (1 observed, 17 projected)

UltraTree Parameters Needed to Match Neural Loss

Training compute time vs parameters

This compares estimated training compute time against parameter count for both systems. For UltraTree, time comes from node-creation timestamps in Postgres (ultratree.nodes.when_created) with a 24-hour active-gap cutoff. For neural models, time comes from recorded training time (ultratree.evaluation_runs.training_seconds).

Overall and per-carefulness timing comparisons:

All regimes (frontier across all UltraTree runs): median slowdown = 2172.161290323x. UltraTrees need 23.212708121x fewer parameters than neural networks to match the same mean loss (per token). (median parameter-efficiency factor = 23.212708121). Adjusted compute estimate = 93.576384065x (slowdown / parameter-efficiency factor).
Careful10: insufficient matched frontier points with finite UltraTree/neural training-time estimates.
Careful100: median slowdown = 2960.840000000x. UltraTrees need 3.387609801x fewer parameters than neural networks to match the same mean loss (per token). (median parameter-efficiency factor = 3.387609801). Adjusted compute estimate = 874.020378306x (slowdown / parameter-efficiency factor).
Careful1000 (Sense Annotated 1): median slowdown = 1933.119230370x. UltraTrees need 28.599558274x fewer parameters than neural networks to match the same mean loss (per token). (median parameter-efficiency factor = 28.599558274). Adjusted compute estimate = 67.592625448x (slowdown / parameter-efficiency factor).
Careful10000: median slowdown = 5656.764677419x. UltraTrees need 2188.445813221x fewer parameters than neural networks to match the same mean loss (per token). (median parameter-efficiency factor = 2188.445813221). Adjusted compute estimate = 2.584831958x (slowdown / parameter-efficiency factor).

Training Compute Time vs Parameter Count

Noun loss

Instead of looking at the total loss over all parts of speech, we would expect that nouns would get the most benefit from having sense annotation into a hierarchy.

But the data shows the exact opposite: as we train, we are increasing the loss on nouns, which means that the loss on all other parts of speech much be dropping even more rapidly.

Noun Loss vs Model Size

We do see that the ultratree models soundly outperform neural network models on nouns though. Neural networks are behaving as one would expect: larger models have more generalised learning.

Noun Loss vs Neural Networks

Theory: the ultrametric models mostly predict nouns, because nouns are the most common part of speech in the corpus, and they can group parts of speech together into an aggregate. The neural network mostly predicts punctuation, since it has no way of aggregating parts of speech together without internalising rules of grammar. The .'' character is the most common word'' in the corpus, so all else being equal, it will get predicted more often.

Context usage

We can see which contexts get used for node splitting. (This is not the same as asking which nodes get used the most often in inference.)

Histogram of context usage

Everything Else

Total Loss

Total Loss vs Model Size

Total Loss vs Model Size for the Careful 10000 model

Noun Loss

Noun Loss vs Model Size for Sense Annotated 1

Noun Loss vs Model Size for the Careful 10000 model

Time Views

Total Loss vs Time

Noun Loss vs Time

Model Node Count vs Time

Model Complexity

Average Depth vs Time

Average In-Region Hits vs Time

Context Usage

Sense Annotated

Unannotated

How to reproduce these results

0. Prerequisites

Go toolchain (for go/ultrametric-trees/bin/*)
Python 3 (for scripts/*.py and python/results/*.py)
psql and a PostgreSQL instance
A PostgreSQL database URL in ULTRATREE_DATABASE_URL

If you use config/ultratree.env locally:

set -a
source config/ultratree.env
set +a

1. Restore the mini database (recommended)

Download the mini dump:

curl -LO https://datadumps.ifost.org.au/ultratree/ultratree-mini-latest.pgdump.zst
zstd -d ultratree-mini-latest.pgdump.zst -o ultratree-mini.pgdump

Restore it into an empty database:

createdb ultratree_mini
pg_restore --no-owner --no-privileges -d ultratree_mini ultratree-mini.pgdump
psql ultratree_mini -c "CREATE EXTENSION IF NOT EXISTS pgcrypto;"
export ULTRATREE_DATABASE_URL="postgresql://.../ultratree_mini"
export ULTRATREE_SCHEMA="ultratree_mini"

2. Train UltraTree models (optional)

Training writes directly to Postgres. The key flags are the dataset to train on and the model name to give the run:

cd go/ultrametric-trees
make

./bin/train \
  --database-url "$ULTRATREE_DATABASE_URL" \
  --schema "$ULTRATREE_SCHEMA" \
  --dataset sense-annotated-training-dataframe \
  --model-name careful10 \
  --carefulness 10 \
  --max-nodes 101

3. Evaluate UltraTree models (optional)

cd go/ultrametric-trees
./bin/evaluatemodel \
  --database-url "$ULTRATREE_DATABASE_URL" \
  --schema "$ULTRATREE_SCHEMA" \
  --run-description "careful10 (example)" \
  --model careful10 \
  --dataset sense-annotated-test-dataframe \
  --limit 200

4. Build and view the site

./scripts/build_site_from_postgres.sh --schema "$ULTRATREE_SCHEMA"
open site/dist/index.html  # macOS

Or serve locally:

python3 -m http.server --directory site/dist 8000