# Xi Corpus Atlas — Project Registry
**习近平语料库图谱**

This repository is an **umbrella** for several independent text-analysis projects
over the Xi-era corpus. Each project has its own source texts, derived data,
build scripts, and output page, but they share the site chrome
(`build/_shell.py`) and the landing page (`index.html`).

Machine-readable version: [`corpus/projects.json`](corpus/projects.json).

> **Shared input note:** `sources/books/` feeds **two** projects — Historical
> Allusions (allusion extraction) *and* Chengyu Frequency (idiom counting). So
> the projects are defined by their pipeline + output, not by folder ownership.

---

## 1 · Historical Allusions — 经典用典图谱
Maps Xi's use of classical/historical allusions (用典/典故) and the dynastic-statecraft
sources he draws on.

| | |
|---|---|
| **Output** | `REPORT.html` |
| **Sources** | `sources/allusion_atlases/`, `sources/books/`, `sources/ccdi_pindian/`, `sources/pingyu/`, `sources/yongdian_renmin/` |
| **Derived** | `corpus/allusions.jsonl`, `goc_allusions.jsonl`, `entries.jsonl`, `index.csv`, `figure_hits.jsonl`, `formula_hits.jsonl`, `phrase_hits.jsonl`, `allusion_evidence.md` |
| **Build** | `build_corpus`, `extract_goc_allusions`, `build_combined_allusion_report`, `build_phrase/figure/formula_index`, `build_html_report`, `ingest_*`, `scrape_yongdian_renmin/ccdi_dianliang/pingyu` |

## 2 · Politburo Study — 政治局集体学习
Every Politburo collective-study session of the Hu and Xi eras (107 logged):
date, topic, lecturer, institutional affiliation, the People's Daily readout, and —
where later published — the full first-person speech (e.g. 求是).

| | |
|---|---|
| **Output** | `POLITBURO_DASHBOARD.html` |
| **Sources** | `sources/politburo_study/readouts/` (107), `sources/politburo_study/speeches/` (later-published speeches) |
| **Derived** | `corpus/politburo_study_sessions.jsonl`, `readout_vocab_analysis.json` |
| **Build** | `scrape_politburo_study`, `fetch_politburo_readouts`, `fetch_politburo_wayback`, `build_readout_html`, `build_speech_html`, `analyze_readout_vocab`, `build_politburo_dashboard` |

**Speech convention:** a session's later-published speech lives at
`sources/politburo_study/speeches/NNN_<date>_<slug>_speech.md`, where `NNN` is the
same global ordinal as the readout. `build_speech_html.py` renders it; the dashboard
auto-links it via a "📜 讲话全文" pill.

## 3 · Chengyu Frequency — 成语频率图谱
Frequency atlas of four-character idioms (成语) across the Xi-authored corpus,
with per-chengyu detail pages.

| | |
|---|---|
| **Output** | `CHENGYU_REPORT.html` + `chengyu/<phrase>.html` |
| **Sources** | `sources/books/` *(shared with Historical Allusions)* |
| **Derived** | `corpus/chengyu_hits.jsonl`, `chengyu_sources.{jsonl,csv}`, `chengyu_summary.json`, `chengyu_detail/` |
| **Build** | `build_chengyu_index`, `build_chengyu_detail`, `build_chengyu_report` |

---

## Shared / umbrella
- **Site chrome:** `build/_shell.py` (`top_nav`, `TOP_NAV_CSS`) — used by all output pages.
- **Landing page:** `index.html` ← `build/build_index.py`.
- **Master tracker:** `build/build_master_tracker.py` → `Xi_Corpus_Master_Tracker.xlsx`.

## Layout convention
- `sources/` — raw primary text, one `.md` per item, YAML frontmatter, grouped by provenance.
- `corpus/` — derived indices/aggregations (regenerable; not hand-edited).
- `build/` — pipeline scripts: `scrape_*`/`fetch_*` → `ingest_*` → `build_*`.
- root `*.html` + `chengyu/*.html` — generated site pages.