Michael Burkhardt's
TPS Frequency Dictionary of Mandarin Chinese
INTRODUCTION
The dictionary and word list each contain a total of 26,704 entries of between one and four
characters in length. Each entry appears in one of 2,500 character
sections, arranged according to character frequency. Within each
character section, words are listed in frequency order and grouped in
numbered sections from 1 through 5, meant to give a general sense of
how common the words are: group 1 words are the most common,
while group 5 words are relatively rare. The character section in which a
particular entry appears--all entries appear only once--has been
determined by the least common component character. For example, the
common compound 以后 is not listed with character #18 (以) but rather
with #59 (后). This three-way ordering and filtering technique is what I call
the "Triple Progression System."
The character and word data used to build the word list are derived from
Jun Da's Chinese Text Computing website,
and in particular
the character and word frequency distribution data related to a news and
information based subcorpus of modern Mandarin Chinese. In order to
eliminate nonsense character combinations and other frequently
occurring non-words, the
CC-CEDICT Chinese-English dictionary was
used to validate the data. Words and phrases not appearing in CC-CEDICT
were excluded. (However, some high frequency non-word
collocations that also happen to be low frequency words, such as 比亚,
do appear in the list.)
In all, 152,688 bigrams, trigrams, and guadrigrams were processed using
a Perl script that compared them with a list of the top 2,500 characters
and the CC-CEDICT dictionary. The resulting list contains a total of
26,704 entries (2,500 single character, 20,294 bigrams, 999 trigrams,
and 2,911 quadrigrams).
There are 42 character sections that do not include any word entries. Of
these 42 characters, 14 appear in entries elsewhere in the dictionary.
The other 28 characters do not appear anywhere in the dictionary.
The words accompanying each character section have been grouped
into levels according to word frequency.
Jun Da suggests
three frequency (X) ranges (or "stages"): very low (X≤5), medium-low to low
(5<X≤50), and medium to high (X>50) for the acquisition of new
vocabulary by foreign learners of Chinese. For this project, the
lowest frequency range has been eliminated, and the top range has been
split up, resulting in a total of five ranges, or groups, as follows:
(1) 1,000 ≤ X 1,495 words (most common)
(2) 250 ≤ X < 1,000 2,979 words
(3) 100 ≤ X < 250 3,251 words
(4) 50 ≤ X < 100 3,037 words
(5) 5 ≤ X < 50 13,442 words (least common)
This arrangement results in the smallest number of words at level one
(the most common words), relatively uniform distribution over levels two
through four, and a sizable group of the least common words at level five.
FREE DOWNLOAD (Updated May 6, 2010)
TPS Word List Only (No Pinyin or English)
(All files are UTF-8 encoded text.)
BUY THE BOOK
The print edition of the TPS Frequency Dictionary of Mandarin Chinese combines the
TPS frequency dictionary and TPS grouped word list into a single bound volume,
along with a Pinyin Index and list of modern radicals.
UPDATES
(May 6, 2010) Complete Frequency Dictionary version now available!
(Apr 27, 2010) I'm working on an extended version that includes Pinyin pronunciations and English translations. I should have it available soon.
(Apr 26, 2010) A Pleco-friendly version of the list is available. (See above.)