The Xi Corpus Atlas is a set of quantitative, source-traceable studies of Xi Jinping–era primary texts. This page explains what the Atlas is, what it counts as a source, how each measure is computed, and — just as important — how to read the results with appropriate caution.
Four independent projects share one principle: every chart, count, and table is regenerated from a library of individual source documents, and every number can be traced back to the exact passage that produced it. Nothing is hand-entered; the figures fall out of an open, repeatable pipeline.
Which classical works, figures, and dynasties Xi cites — reconstructed from the footnoted citations in his published works and the official allusion atlases.
All 107 collective-study sessions of the Hu and Xi eras: topic, lecturer, institution, the official readout, and — where later published — the full speech, with language-change analysis across the three Central Committee terms.
How often four-character idioms and signature Xi-era formulations recur across Xi's own books, by year and volume — every occurrence auditable to its line.
The Party's deliberative-coordinating bodies: an org chart of who Xi personally chairs, plus per-body meeting histories (every session, date, readout, and speech).
The corpus is built to measure Xi's own language and choices, not commentary about them. Sources are organized in tiers, and editorial or secondary material is kept out of the word-level counts:
For a single meeting there can be two very different texts:
These are different registers, so the Atlas never merges them. Where both exist they can be opened side by side, and any vocabulary measured from readouts is labelled as the summarizers' wording, not Xi's verbatim words. One wire-service trap is worth naming: an announcement of a forthcoming article ("…文章指出…文章强调…") reads like the speech but is not; the pipeline detects and rejects those, keeping only the full first-person text.
Classical allusions are reconstructed from the footnoted citations in Xi's
published works and the allusion atlases — not from fuzzy text matching. Each citation is
resolved to its classical source; where the source is a named work or figure it is
classified by dynasty or period and linked out to the Chinese Text Project
(ctext.org) so the original can be read in context. The Atlas shows the
evidence inline; it does not host the full text of the books it draws from.
Idiom and signature-phrase counts are exact-substring matches over the Tier-1 Xi-authored corpus only. Two rules keep the numbers honest:
协调发展 inside 区域协调发展). Counting uses a
non-overlapping, longest-match pass so each stretch of text is credited to exactly
one phrase, and affected cases are flagged rather than silently double-counted.Every occurrence is kept as one auditable row (date, volume, surrounding text), so any frequency can be drilled down to its individual hits.
For the Politburo project the Atlas tracks how the language of the curriculum changes across the 18th, 19th, and 20th Central Committees. Three methods work together:
The published-speech corpus is small and editorially selected (only some sessions get a speech published), so a few long documents can dominate a term. To make that visible rather than misleading, each trajectory can be viewed under three normalizations:
Where the views agree, the trend is robust; where they diverge, the number is driven by a few documents — and the page flags single-document spikes outright. This is why, for example, an apparent early-term spike in ideological vocabulary shrinks once each speech is weighted equally: a real but smaller signal wearing an artifact's clothing.
Each study session is assigned a theme by a scored, multi-label classifier, not a first-keyword-wins rule. Every theme is scored by how much of its keyword evidence is present (weighting longer, more specific phrases over short generic ones); the top theme is primary and a clear runner-up is kept as a secondary, so genuinely cross-cutting sessions carry both labels. A small, fully documented override file corrects the handful of cases the score gets wrong — each override records the reason, so the classification is auditable.
The newest project charts the CCP Central Committee's deliberative-and-coordinating bodies (中央议事协调机构). The overview delineates which bodies Xi personally chairs; clicking a body opens its full meeting history. Three conventions keep it honest:
Primary sources are official: People's Daily / Xinhua, Qiushi, the CCDI and
People's Daily theory columns, CCTV, gov.cn, the State Commission Office for
Public Sector Reform, and Xi's published volumes. Each session, speech, and meeting links
out to its original source. The standard for any external claim is simple: cite a real,
checkable URL, or mark it unverified — never fabricate a quotation, a date, or an
attribution. Where a fact is widely reported but not yet pinned to a primary citation,
it is labelled as such.
The library of source documents is the single source of truth; all derived data and every page on this site are regenerated from it by open scripts. Per-item records — one row per allusion, per idiom occurrence, per session, per meeting — let any figure be checked against the underlying passage.