Methodology方法说明

The Xi Corpus Atlas is a set of quantitative, source-traceable studies of Xi Jinping–era primary texts. This page explains what the Atlas is, what it counts as a source, how each measure is computed, and — just as important — how to read the results with appropriate caution.

What this is

Four independent projects share one principle: every chart, count, and table is regenerated from a library of individual source documents, and every number can be traced back to the exact passage that produced it. Nothing is hand-entered; the figures fall out of an open, repeatable pipeline.

Historical Allusions经典用典图谱

Which classical works, figures, and dynasties Xi cites — reconstructed from the footnoted citations in his published works and the official allusion atlases.

Politburo Study政治局集体学习

All 107 collective-study sessions of the Hu and Xi eras: topic, lecturer, institution, the official readout, and — where later published — the full speech, with language-change analysis across the three Central Committee terms.

Chengyu Frequency成语频率图谱

How often four-character idioms and signature Xi-era formulations recur across Xi's own books, by year and volume — every occurrence auditable to its line.

Commissions & Leading Groups中央委员会与领导小组

The Party's deliberative-coordinating bodies: an org chart of who Xi personally chairs, plus per-body meeting histories (every session, date, readout, and speech).

Source principle: Xi's own words

The corpus is built to measure Xi's own language and choices, not commentary about them. Sources are organized in tiers, and editorial or secondary material is kept out of the word-level counts:

Readout vs. speech — a distinction we keep separate

For a single meeting there can be two very different texts:

These are different registers, so the Atlas never merges them. Where both exist they can be opened side by side, and any vocabulary measured from readouts is labelled as the summarizers' wording, not Xi's verbatim words. One wire-service trap is worth naming: an announcement of a forthcoming article ("…文章指出…文章强调…") reads like the speech but is not; the pipeline detects and rejects those, keeping only the full first-person text.

How allusions are measured

Classical allusions are reconstructed from the footnoted citations in Xi's published works and the allusion atlases — not from fuzzy text matching. Each citation is resolved to its classical source; where the source is a named work or figure it is classified by dynasty or period and linked out to the Chinese Text Project (ctext.org) so the original can be read in context. The Atlas shows the evidence inline; it does not host the full text of the books it draws from.

How idiom & phrase frequency is measured

Idiom and signature-phrase counts are exact-substring matches over the Tier-1 Xi-authored corpus only. Two rules keep the numbers honest:

Every occurrence is kept as one auditable row (date, volume, surrounding text), so any frequency can be drilled down to its individual hits.

Vocabulary shifts across the three terms

For the Politburo project the Atlas tracks how the language of the curriculum changes across the 18th, 19th, and 20th Central Committees. Three methods work together:

Sensitivity analysis — separating signal from artifact

The published-speech corpus is small and editorially selected (only some sessions get a speech published), so a few long documents can dominate a term. To make that visible rather than misleading, each trajectory can be viewed under three normalizations:

Where the views agree, the trend is robust; where they diverge, the number is driven by a few documents — and the page flags single-document spikes outright. This is why, for example, an apparent early-term spike in ideological vocabulary shrinks once each speech is weighted equally: a real but smaller signal wearing an artifact's clothing.

Theme classification

Each study session is assigned a theme by a scored, multi-label classifier, not a first-keyword-wins rule. Every theme is scored by how much of its keyword evidence is present (weighting longer, more specific phrases over short generic ones); the top theme is primary and a clear runner-up is kept as a secondary, so genuinely cross-cutting sessions carry both labels. A small, fully documented override file corrects the handful of cases the score gets wrong — each override records the reason, so the classification is auditable.

Commissions & leading groups

The newest project charts the CCP Central Committee's deliberative-and-coordinating bodies (中央议事协调机构). The overview delineates which bodies Xi personally chairs; clicking a body opens its full meeting history. Three conventions keep it honest:

Sourcing, provenance & verification

Primary sources are official: People's Daily / Xinhua, Qiushi, the CCDI and People's Daily theory columns, CCTV, gov.cn, the State Commission Office for Public Sector Reform, and Xi's published volumes. Each session, speech, and meeting links out to its original source. The standard for any external claim is simple: cite a real, checkable URL, or mark it unverified — never fabricate a quotation, a date, or an attribution. Where a fact is widely reported but not yet pinned to a primary citation, it is labelled as such.

Reproducibility & auditability

The library of source documents is the single source of truth; all derived data and every page on this site are regenerated from it by open scripts. Per-item records — one row per allusion, per idiom occurrence, per session, per meeting — let any figure be checked against the underlying passage.

Caveats & limitations