Methodology方法说明

The Xi Corpus Atlas is a set of quantitative, source-traceable studies of Xi Jinping–era primary texts. This page explains what the Atlas is, what it counts as a source, how each measure is computed, and — just as important — how to read the results with appropriate caution.

What this is

Four independent projects share one principle: every chart, count, and table is regenerated from a library of individual source documents, and every number can be traced back to the exact passage that produced it. Nothing is hand-entered; the figures fall out of an open, repeatable pipeline.

Historical Allusions经典用典图谱

Which classical works, figures, and dynasties Xi cites — reconstructed from the footnoted citations in his published works and the official allusion atlases.

Politburo Study政治局集体学习

All 107 collective-study sessions of the Hu and Xi eras: topic, lecturer, institution, the official readout, and — where later published — the full speech, with language-change analysis across the three Central Committee terms.

Chengyu Frequency成语频率图谱

How often four-character idioms and signature Xi-era formulations recur across Xi's own books, by year and volume — every occurrence auditable to its line.

Commissions & Leading Groups中央委员会与领导小组

The Party's deliberative-coordinating bodies: an org chart of who Xi personally chairs, plus per-body meeting histories (every session, date, readout, and speech).

Source principle: Xi's own words

The corpus is built to measure Xi's own language and choices, not commentary about them. Sources are organized in tiers, and editorial or secondary material is kept out of the word-level counts:

Tier 1 — Xi's continuous prose. His published books and speech compilations (The Governance of China vols 1–5, the economic anthology, 之江新语, 摆脱贫困, and others). This is the backbone of the frequency scans.
Tier 2 — annotated allusion atlases. The official 习近平用典 compilations, which pair a Xi quotation with its classical source and an editor's gloss. Used for the allusion mapping; the editors' commentary is not counted as Xi's words.
Tier 3 — meeting readouts. The People's Daily / Xinhua summaries of Politburo study sessions and of commission meetings. These are third-person paraphrase, not verbatim Xi — see the next section.
Tier 4 — reception. Documentary transcripts and explanatory columns about Xi's citations (e.g. CCTV 平“语”近人, CCDI 品典). Catalogued for context but kept out of the Xi-authored word counts.

Readout vs. speech — a distinction we keep separate

For a single meeting there can be two very different texts:

the readout (通稿) — a third-person summary released the same day ("Xi stressed that …"); and
the speech (讲话全文) — Xi's own first-person address, often published months later (e.g. in Qiushi).

These are different registers, so the Atlas never merges them. Where both exist they can be opened side by side, and any vocabulary measured from readouts is labelled as the summarizers' wording, not Xi's verbatim words. One wire-service trap is worth naming: an announcement of a forthcoming article ("…文章指出…文章强调…") reads like the speech but is not; the pipeline detects and rejects those, keeping only the full first-person text.

How allusions are measured

Classical allusions are reconstructed from the footnoted citations in Xi's published works and the allusion atlases — not from fuzzy text matching. Each citation is resolved to its classical source; where the source is a named work or figure it is classified by dynasty or period and linked out to the Chinese Text Project (ctext.org) so the original can be read in context. The Atlas shows the evidence inline; it does not host the full text of the books it draws from.

How idiom & phrase frequency is measured

Idiom and signature-phrase counts are exact-substring matches over the Tier-1 Xi-authored corpus only. Two rules keep the numbers honest:

Frontmatter is excluded — tags and metadata at the top of each source file never count as usage.
Substring contamination is controlled — a short idiom can be swallowed by a longer phrase (e.g. 协调发展 inside 区域协调发展). Counting uses a non-overlapping, longest-match pass so each stretch of text is credited to exactly one phrase, and affected cases are flagged rather than silently double-counted.

Every occurrence is kept as one auditable row (date, volume, surrounding text), so any frequency can be drilled down to its individual hits.

Vocabulary shifts across the three terms

For the Politburo project the Atlas tracks how the language of the curriculum changes across the 18th, 19th, and 20th Central Committees. Three methods work together:

Curated thematic phrase buckets. ~230 signature phrases are grouped into the same eleven themes used elsewhere on the site. Every phrase in every bucket is disclosed on the page, each is verified to actually occur in the corpus, and counts use the non-overlapping longest-match rule above. Buckets are deliberately unequal in size, so the page's guidance is "read the slope, not the rank": a line's rise or fall over time is the reliable signal, not its absolute height relative to other themes.
Rates per 10,000 characters. Counts are normalized by text volume so terms of different length compare fairly.
Weighted log-odds (the "Fightin' Words" method). Distinctive phrasing for one term — or for the speech register vs. the readout register — is surfaced with an informative-prior log-odds model that promotes genuinely characteristic language and suppresses boilerplate.

Sensitivity analysis — separating signal from artifact

The published-speech corpus is small and editorially selected (only some sessions get a speech published), so a few long documents can dominate a term. To make that visible rather than misleading, each trajectory can be viewed under three normalizations:

Per 10K characters — pools a term's text (weights by document length);
Per document — averages each document's own rate (one vote each, so a single long speech can't dominate);
Document frequency — in how many documents a phrase appears at all.

Where the views agree, the trend is robust; where they diverge, the number is driven by a few documents — and the page flags single-document spikes outright. This is why, for example, an apparent early-term spike in ideological vocabulary shrinks once each speech is weighted equally: a real but smaller signal wearing an artifact's clothing.

Theme classification

Each study session is assigned a theme by a scored, multi-label classifier, not a first-keyword-wins rule. Every theme is scored by how much of its keyword evidence is present (weighting longer, more specific phrases over short generic ones); the top theme is primary and a clear runner-up is kept as a secondary, so genuinely cross-cutting sessions carry both labels. A small, fully documented override file corrects the handful of cases the score gets wrong — each override records the reason, so the classification is auditable.

Commissions & leading groups

The newest project charts the CCP Central Committee's deliberative-and-coordinating bodies (中央议事协调机构). The overview delineates which bodies Xi personally chairs; clicking a body opens its full meeting history. Three conventions keep it honest:

Confidence is labelled. Each body is marked verified (chair cited), reported (widely reported, citation pending), or to confirm (the current head is not officially published — shown as such rather than guessed).
Missing readouts are flagged, not hidden. Where a meeting was held without a public readout, or where the official numbering implies a meeting that never surfaced in public indexes, the gap is marked explicitly — a recurring and analytically interesting feature of how these bodies disclose their work.
Active vs. dormant is shown. Each built-out body carries a current-status note, so a body that has stopped meeting publicly is distinguished from one still convening.

Sourcing, provenance & verification

Primary sources are official: People's Daily / Xinhua, Qiushi, the CCDI and People's Daily theory columns, CCTV, gov.cn, the State Commission Office for Public Sector Reform, and Xi's published volumes. Each session, speech, and meeting links out to its original source. The standard for any external claim is simple: cite a real, checkable URL, or mark it unverified — never fabricate a quotation, a date, or an attribution. Where a fact is widely reported but not yet pinned to a primary citation, it is labelled as such.

Reproducibility & auditability

The library of source documents is the single source of truth; all derived data and every page on this site are regenerated from it by open scripts. Per-item records — one row per allusion, per idiom occurrence, per session, per meeting — let any figure be checked against the underlying passage.

Caveats & limitations

Readouts are paraphrase. Vocabulary drawn from session or meeting readouts is the state-media summarizers' wording, not Xi's verbatim language.
Frequency ≠ emphasis. A high count reflects recurrence in published text, which is shaped by editing and genre as much as by intent.
Small samples & selection. The published-speech corpus is partial and non-random; treat term-to-term levels there as indicative, and prefer the sensitivity views.
Bucket sizes are unequal. Thematic-trajectory heights partly reflect how many phrases a bucket contains — read the slope, not the rank.
Coverage is uneven and evolving. Some sessions never receive a published speech; some meeting indexes contain documented gaps or dates still being confirmed. These are marked on the page.
Translation is a gloss. English labels are convenience translations; the Chinese is authoritative.