# Xi Jinping Historical Allusions Corpus

A dual-purpose library of Xi Jinping's speeches, writings, and curated allusion
compilations — organized so that (a) a human can read any individual speech,
episode, or entry directly, and (b) a machine can aggregate the whole corpus
for quantitative analysis (word counts, entity frequency, dynasty/emperor
tallies, etc.).

## Layout

```
xi-corpus-atlas/
  sources/          # canonical human-readable library (source of truth)
    pingyu/         # 平"语"近人 CCTV documentary transcripts
      s1/           # Season 1 (2018), 12 episodes
      s2/           # Season 2 (2021), 12 episodes
    yongdian_renmin/  # 人民网 theory.people.com.cn 习近平用典 column
    ccdi_pindian/   # 中央纪委国家监委 品典/用典释义 column
    columns_qiushi/ # 求是 用典 column
    speeches/       # individual Xi speeches (集体学习, CCDI plena, etc.)
    books/          # ingested book chapters (Governance of China, 用典 vols, 之江新语, ...)
  corpus/           # DERIVED, machine-readable. Regenerated by build/*.py.
    entries.jsonl          # one line per atomic unit
    allusions.jsonl        # one line per Xi-quote → classical-source pair
    goc_allusions.jsonl    # structured footnote citations from GoC + Jingji
    chengyu_hits.jsonl     # one line per chengyu occurrence (date, volume, bigrams)
    chengyu_summary.json   # aggregates: by_volume + by_year + top_left_bigrams per phrase
    chengyu_sources.csv    # flat pivot-table CSV (phrase × year × article)
    chengyu_detail/        # per-phrase audit .md (chronological hits + substring audit)
    index.csv              # flat spreadsheet view
  chengyu/          # per-phrase HTML detail pages (linked from CHENGYU_REPORT.html)
  seeds/            # curated seed lists (chengyu_hsk.json: 235 + 35 phrases)
  build/            # scripts
  inbox/            # drop-zone for PDFs/epubs; the ingester moves them to sources/books/
  _cache/           # raw HTML cached from scrapes, for provenance/reprocessing

  REPORT.html                # Historical allusions — dynasty tabs + ctext.org links
  CHENGYU_REPORT.html        # Chengyu frequency + by-year sparkline + substring warnings
  POLITBURO_DASHBOARD.html   # Politburo study-session readout analysis
  index.html                 # landing page linking all three
```

## The atomic unit

Everything in `sources/` is a **markdown file with YAML frontmatter**. One file
per atomic unit. An atomic unit is:

- one 平"语"近人 episode
- one 习近平用典 entry
- one CCDI 品典 entry
- one speech
- one book chapter / article inside a speech compilation

Never group multiple units into one file. Never split one unit across files.

## Frontmatter schema

Every source file has this frontmatter block at the top:

```yaml
---
id: pingyu-s2-e05                    # stable, unique, globally
source: pingyu-jinren                # source-family key (see below)
source_tier: 1                       # 1 = authoritative primary, 2 = curated aggregator, 3 = secondary
type: documentary_episode            # documentary_episode | speech | yongdian_entry | column_entry | book_chapter
title_zh: 水则载舟 水则覆舟
title_en: ""                         # empty string if no English version exists
date: 2021-02-25                     # publication/broadcast date of THIS artifact
speech_date: 2014-05-04              # when Xi actually said it, if different; else same as date
speakers: [xi-jinping, scholar-A]    # who speaks in the text. xi-jinping is the one we care about.
url: https://www.12371.cn/...
sha256: 0a1b2c...                    # of the raw HTML/PDF pulled, for provenance
retrieved: 2026-04-14
lang: zh                             # zh | en | zh-en (bilingual in one file)
parallel_id: ""                      # if a sibling file exists in the other language, its id
tags: [governance, people-centered, tang-taizong]
---
```

Body below the frontmatter = the actual readable text. Transcripts use
speaker labels (`**Xi Jinping:**`, `**主持人:**`, `**解读嘉宾 王立群:**`). Speeches
are plain prose. Book chapters preserve section headings.

## Source keys

| source key          | description                                     |
|---------------------|-------------------------------------------------|
| pingyu-jinren       | 平"语"近人 CCTV series                           |
| yongdian-renmin     | 人民网 theory.people.com.cn 习近平用典 column    |
| ccdi-pindian        | CCDI 品典 / 用典释义 column                      |
| qiushi-yongdian     | 求是 用典专题                                    |
| goc-v1 … goc-v5     | 习近平谈治国理政 / Governance of China, vols 1-5 |
| yongdian-book-v1    | 《习近平用典》第一辑 (People's Daily Press 2015) |
| yongdian-book-v2    | 《习近平用典》第二辑 (People's Daily Press 2018) |
| zhijiang-xinyu      | 《之江新语》                                      |
| jingji-gangyao      | 《习近平经济思想学习纲要》                        |
| baituo-pinkun       | 《摆脱贫困》                                      |
| ganzai-shichu       | 《干在实处走在前列》                              |
| lishi-jiaoke        | 《历史是最好的教科书》                            |
| speech-cpbm         | 中央政治局集体学习 speeches                       |
| speech-ccdi         | CCDI 全会 speeches                               |
| speech-other        | other standalone speeches                        |

## ID convention

`{source-key}-{local-id}`. Local ID depends on the source:

- `pingyu-s1-e01` — season + episode
- `yongdian-renmin-2023-08-15-01` — date + within-day sequence
- `ccdi-pindian-2024-03-12` — date
- `speech-cpbm-2014-10-13` — meeting date
- `goc-v1-ch-007` — volume + chapter number within volume
- `yongdian-book-v1-003` — book + entry number

IDs must be stable forever. If you re-scrape the same entry, the file is
overwritten in place — the ID never changes.

## Tags

A controlled vocabulary, grown as needed. Initial set:

**Themes:** `governance`, `economy`, `party-discipline`, `reform`,
`centralization`, `foreign-policy`, `ethnic-frontier`, `rule-of-law`,
`anti-corruption`, `people-centered`, `cadre-selection`, `legalism`,
`historical-cycle`

**Dynasties (when a specific dynasty is invoked):** `pre-qin`, `qin`, `han`,
`wei-jin-nanbeichao`, `sui`, `tang`, `song`, `yuan`, `ming`, `qing`, `modern`

**Emperors / figures (when named):** `tang-taizong`, `han-wendi`, `han-jingdi`,
`kangxi`, `qianlong`, `qin-shihuang`, `sui-yangdi`, `tang-xuanzong-late`,
`su-shi`, `wang-anshi`, `zhang-juzheng`, `fan-zhongfan`, `wei-zheng`,
`zhu-xi`, `confucius`, `mencius`, `xunzi`, `du-mu`, …

Tags live in the frontmatter of each file; the build script reads them.

## Adding new material

1. Drop a PDF/epub into `inbox/`.
2. Run the matching ingester in `build/` (not yet written).
3. Ingester splits the book into atomic units, writes markdown files with
   frontmatter into `sources/books/{source-key}/`, and moves the original into
   a `sources/books/_originals/` folder.
4. Run `python build/build_corpus.py` to regenerate `corpus/`.

## Reading the corpus

**Human:** open `sources/` in Obsidian. Frontmatter becomes queryable via the
Dataview plugin. Side-by-side panes handle parallel zh/en.

**Machine:** read `corpus/entries.jsonl` and `corpus/allusions.jsonl` in a
Jupyter notebook. Everything in `corpus/` is derived — never hand-edit it.

## Derived analyses (all regenerable)

| Report | Generator | Source data | What it answers |
|---|---|---|---|
| `REPORT.html` | `build/build_html_report.py` | `corpus/goc_allusions.jsonl` + `corpus/allusions.jsonl` | Which classical works / figures / dynasties does Xi cite? Organized by figure and by time period, with `ctext ↗` links to the Chinese Text Project. |
| `CHENGYU_REPORT.html` | `build/build_chengyu_report.py` | `corpus/chengyu_summary.json` | Frequency ranking of 235 canonical chengyu + 35 Xi-era policy phrases. Per-phrase year sparklines. Substring-contamination ⚠ badges. |
| `chengyu/<phrase>.html` | `build/build_chengyu_detail.py` | `corpus/chengyu_hits.jsonl` | Every occurrence of a given phrase, chronologically, with article title, quoted context, and a left-bigram audit for detecting compound overreach (e.g. `协调发展` ↔ `区域协调发展`). |
| `corpus/chengyu_sources.csv` | `build/build_chengyu_detail.py` | `corpus/chengyu_hits.jsonl` | Flat CSV for Excel pivots — phrase × year × article × volume. |
| `POLITBURO_DASHBOARD.html` | `build/build_politburo_dashboard.py` | `corpus/politburo_study_sessions.jsonl` + `corpus/readout_vocab_analysis.json` | Politburo collective-study readouts — topic + vocabulary drift. |

Scope & substring caveats are documented in `METHODOLOGY.md`. TL;DR: all
scans use the same Xi-authored corpus (editorial compilations excluded),
YAML frontmatter is skipped in substring scans, and per-hit JSONL keeps
everything auditable.
