Chinese Zero to Hero
Settings are automatically saved as soon as you make the change.

设置

Лорем ипсум долор сит амет, вим еи цаусае импетус, не стет тамяуам про, пер цу ерант тхеопхрастус. Ех вих аутем албуциус ментитум, ад дицит елигенди оффициис иус. Еним лабитур оффендит сед цу, апериам цонсулату продессет нец еа, нулла зрил виртуте цу пер. Еа посидониум детерруиссет вих, вих не партем деленит импердиет. Меа ат харум чоро, деленит фабеллас сит ет, нонумы алтера иисяуе еам ет. Еам еи нисл виртуте.
Translation text is shown. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

语料库设置

语料库 是用大量语言文本编写而成的集合,让我们我们可以提取词语搭配和例句。我们的文本语料库由 Sketch Engine 提供,有许多不同的英语语料库可供选择。根据你选择的语料库,你看到的例句和词语搭配会有所不同。

语料库 代码 语言 单词数 备注
Timestamped JSI web corpus 2014-2019 English eng_jsi_newsfeed_virt English 40,669,324,757 English corpus of news articles obtained from crawled a list of RSS feeds. Corpus tagged by TreeTagger v2.
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
  • Web.
English Web 2013 (enTenTen13) ententen13_tt2_1 English 19,685,733,337 English web corpus. Downloaded by SpiderLing in Dec 2013. Cleaned, deduplicated, tagged by TreeTagger pipeline v2 using modified Penn TreeBank tagset.
  • Parallel. That means L1 translation is available.
  • Web.
Timestamped JSI web corpus 2014-2016 English eng_jsi_newsfeed_1 English 18,315,071,361 English corpus of news articles obtained from crawled a list of RSS feeds. Corpus tagged by TreeTagger v2.
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
  • Web.
English Web 2015 (enTenTen15) ententen15_tt21 English 15,703,895,409 English web corpus. Downloaded by SpiderLing in Nov & Dec 2015. Cleaned, deduplicated, tagged by TreeTagger.
  • Featured.
  • Parallel. That means L1 translation is available.
  • Web.
English Web 2012 (enTenTen12) ententen12_1 English 11,191,860,036 English web corpus. Crawled by SpiderLing in May 2012. Encoded in UTF-8, cleaned, deduplicated, tagged by TreeTagger trained on English Penn Treebank v2.5.
  • Parallel. That means L1 translation is available.
  • Web.
English Web 2008 (enTenTen08) ententen_1 English 2,759,340,513 English web crawled by Heritrix in May 2008. Encoded in UTF-8. Tagged by TreeTagger v2.5.
  • Parallel. That means L1 translation is available.
  • Web.
Open Access Journals (DOAJ - English) doaj_en English 2,662,763,697 English journals from Directory of Open Access Journals database.
  • Parallel. That means L1 translation is available.
OEC v2 oec_biwec3_2 English 2,073,563,928 Oxford English Corpus (OEC + Biwec build v2, Feb 2012, v2)
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
  • Web.
OEC oec_biwec3 English 2,073,319,589 Oxford English Corpus (OEC + Biwec build v2, Feb 2012)
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
  • Web.
LEXMCI lexmci English 1,448,180,339 Corpus of English was created by the Lexicography MasterClass in 2008 as a source of lexicographic information for the lexicographers compiling the Dante database. Tagged by TreeTagger for English.
  • Parallel. That means L1 translation is available.
English Wikipedia enwiki English 1,356,523,079 English Wikipedia Corpus built from Wikipedia dump (from second half of September 2014) using WikiExtractor.py script and a part of Brno pipeline. Encoded in UTF-8, cleaned, deduplicated. Tagged by TreeTagger v2.5.
  • Parallel. That means L1 translation is available.
British Web 2007 (ukWaC) ukwac_tt2_1 English 1,313,058,436 English web corpus from the .uk Internet domain crawled by Heritrix in 2007. Encoded in UTF-8, cleaned and deduplicated. Fixed quotation marks, retokenised and retagged with TreeTagger in 2018.
  • Parallel. That means L1 translation is available.
  • Web.
OPUS2 English opus2_en English 1,139,515,048 English corpus of OPUS2 (open source parallel corpus). Encoded in UTF-8, used Penn Treebank tagset. OPUS2 collection contains 40 languages.
  • Parallel. That means L1 translation is available.
English Corpus for SkELL 3.8 skell_3_8 English 1,041,772,774 English web corpus for SkELL [ver 3.8 # Jul 2017] FFFD removed
  • Parallel. That means L1 translation is available.
  • Web.
English Corpus for SkELL 3.9 skell_3_9 English 1,041,138,575 English web corpus for SkELL [ver 3.9 # Dec 2017] FFFD removed, further filtered
  • Parallel. That means L1 translation is available.
  • Web.
English Corpus for SkELL 3.10 skell_3_10 English 1,038,200,313 English web corpus for SkELL [ver 3.10 # May 2019] removed with contracted n't
  • Parallel. That means L1 translation is available.
  • Web.
Araneum Anglicum Maius [2015] en_araneum_maius_1 English 888,466,066 English Web (crawled in November 2013, version 1.3.10) 1,20 G (build #a048)
  • Parallel. That means L1 translation is available.
  • Web.
Araneum Anglicum Asiaticum Maius [2015] en_as_araneum_maius English 867,259,037 Asian English Web (crawled in September and October 2014, version 1.3.10) 1,20 G (build #a046)
  • Parallel. That means L1 translation is available.
  • Web.
Araneum Anglicum Africanum Maius [2015] en_af_araneum_maius English 854,484,093 African English Web (crawled in February 2015, version 1.3.00) 1,20 G (build #a047)
  • Parallel. That means L1 translation is available.
  • Web.
English Historical Book Collection (EEBO, ECCO, Evans) early_english English 826,296,048 Corpus collection of English books published between 1473 and 1820. Texts are from EEBO Phase I, ECCO and Readex's Evans projects. Tagged by TreeTagger used Penn TreeBank tagset.
  • Featured.
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
Timestamped JSI web corpus 2019-07 English eng_jsi_newsfeed_lastmonth English 787,246,410 English corpus of news articles obtained from crawled a list of RSS feeds. Corpus tagged by TreeTagger v2.
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
  • Web.
English Broadsheet Newspapers 1993–2013 (SiBol with trends) sibolport13_tt21 English 654,435,535 The SiBol (Siena-Bologna) corpus of English broadsheet newspapers from 1993, 1995, 2010, 2013. Tagged by TreeTagger pipeline v. 2.1, computed trends.
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
EUR-Lex English 2/2016 eurlex_eng English 629,722,593 EUR-Lex multilingual corpus of all the official languages of the European Union (currently 24 languages). Tagged by TreeTagger pipeline v2.
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
Project Gutenberg English gutenberg_en English 443,471,071 Corpus from Project Gutenberg html ebooks
  • Parallel. That means L1 translation is available.
UKWaC super sensed ukwacsst English 315,402,632
  • Parallel. That means L1 translation is available.
  • Web.
Oxford Children's Corpus 2016 occ16 English 284,360,063
  • Parallel. That means L1 translation is available.
Timestamped JSI web corpus 2019-08 English eng_jsi_newsfeed_curmonth English 277,241,554 English corpus of news articles obtained from crawled a list of RSS feeds. Corpus tagged by TreeTagger v2.
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
  • Web.
Oxford Children's Corpus 2016 -- Writing occ16w English 229,177,934
  • Parallel. That means L1 translation is available.
New corpus for English (NCI English) nci English 217,548,758 Tagged with TreeTagger.
  • Parallel. That means L1 translation is available.
Oxford Children's Corpus 2015 occ15 English 210,322,185
  • Parallel. That means L1 translation is available.
English Web 2013 sample ententen13_tt2_1_term_ref English 204,976,089 English web corpus for terminology. Downloaded by SpiderLing in Dec 2013. Cleaned, deduplicated, tagged by TreeTagger pipeline v2. With definitions.
  • Parallel. That means L1 translation is available.
Oxford Children's Corpus 2015 -- Writing occ15w English 174,714,324
  • Parallel. That means L1 translation is available.
Oxford Children's Corpus 2014 occ14 English 159,324,873
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
Brexit corpus (English) brexit_1 English 108,452,923 Corpus about Brexit from twitter and web
  • Parallel. That means L1 translation is available.
ScienceBlogs ScienceBlog English 103,175,233 English ScienceBlogs corpus prepared by Akshay Minocha in 2014. The selection of posts and comments from scienceblogs.com from 2006 to the beginning of 2014. The corpus is tagged using TreeTagger with the Penn tagset v2.5.
  • Parallel. That means L1 translation is available.
  • Web.
British National Corpus (BNC) bnc2_tt21 English 96,134,547 Balanced English corpus of written and spoken language. Processed by TreeTagger pipeline v2.1
  • Featured.
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
British National Corpus (BNC), tagged by CLAWS bnc2 English 96,052,598 Balanced English corpus of written and spoken language. Encoded in UTF-8. Tagged by CLAWS v5 tagger.
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
New Model Corpus model_corpus_1 English 95,276,958 New Model Corpus a balanced 100m-word corpus of general English built from the Web in 2008. Encoded in UTF-8, cleaned and deduplicated. Tagged with TreeTagger, used English Penn Treebank tagset v2.5.
  • Parallel. That means L1 translation is available.
  • Web.
Boot Camp English ententen15_tt21_bootcamp_sample English 85,683,246 English web corpus. Downloaded by SpiderLing in Nov & Dec 2015. Cleaned, deduplicated, tagged by TreeTagger. For boot camp purposes.
  • Featured.
  • Parallel. That means L1 translation is available.
Corpus of Academic Journal Articles (CAJA) CAJA English 79,107,410
  • Parallel. That means L1 translation is available.
Oxford Corpus of Academic English (April 2012) ocae_test English 71,372,972 https://www.sketchengine.co.uk/oxford-corpus-of-academic-english/
  • Parallel. That means L1 translation is available.
ACL Anthology Reference Corpus (ARC) aclarc_2 English 62,196,334 Anthology Reference Corpus is a digital archive of 18,288 research papers in computational linguistics sponsored by the Association for Computational Linguistics (ACL). This release contains most of the papers that appear up to 2015. Tagged with Penn Treebank tagset.
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
DGT, English dgt__english English 59,106,576 Eglish corpus of DGT-Translation Memory consisting 24 languages. Tagged with English Penn Treebank tagset v2.5.
  • Parallel. That means L1 translation is available.
Oxford Children's Corpus 2016 -- Reading occ16r English 53,858,955
  • Parallel. That means L1 translation is available.
EUROPARL7, English europarl7_en English 53,837,625 Parallel corpus of European Parliament proceedings. Encoded in UTF-8. Expanded in 2010 and 2012. Tagged by TreeTagger in 2015.
  • Parallel. That means L1 translation is available.
  • Spoken.
EUR-Lex judgments English 12/2016 judgments_eurlex_eng English 42,339,337 EUR-Lex judgments corpus is a multilingual corpus in all the official languages of the European Union focused only on judgments of the Court of Justice and thus a subset of the whole EUR-Lex corpus.
  • Parallel. That means L1 translation is available.
  • Diachronic. That means time information is available.
pukWaC (ukWaC parsed with MaltParser) pukwac English 39,502,648 The part of ukWaC (British web corpus) created by Siva Reddy and parsed using Malt Parser.
  • Parallel. That means L1 translation is available.
  • Web.
Oxford Children's Corpus 2015 -- Reading occ15r English 34,284,687
  • Parallel. That means L1 translation is available.
Medical Web Corpus web_med_1 English 33,961,786 web_med
  • Parallel. That means L1 translation is available.
  • Web.
PICAE 2010 PICAE2010 English 31,025,920
  • Parallel. That means L1 translation is available.
EcoLexicon English (Environment) ecolexicon_en English 23,169,446
  • Parallel. That means L1 translation is available.
CHILDES English Corpus childes_en2_2 English 22,693,506 English Corpus is one of CHILDES corpora of child language. Encoded in UTF-8, not tagged yet. The whole database contains 24 languages.
  • Parallel. That means L1 translation is available.
  • Spoken.
Open American National Corpus (written) oanc_text English 11,048,137 The Open American National corpus merged with The Manually Annotated Sub-Corpus. September 2016. Encoded in UTF-8, deduplicated, enriched by metadata. Tagged with TreeTagger pipeline v2.
  • Parallel. That means L1 translation is available.
British Law Report Corpus blarc English 8,515,749 British Law Report Corpus of texts at UK courts and tribunals from 2008 to 2010. Encoded in UTF-8. Tagged by TreeTagger.
  • Parallel. That means L1 translation is available.
Brown Family, CLAWS + TreeTagger tags brown_family_1 English 6,975,474 English corpus from Brown family including other corpora. Encoded in UTF-8. Tagged by CLAWS + TreeTagger.
  • Parallel. That means L1 translation is available.
British Academic Written English Corpus (BAWE) bawe2 English 6,968,089 British Academic Written English corpus of good-standard student assignments. For Sketch Engine prepared by Paul Thompson and Alois Heuboeck at Reading. Tagged by Paul Rayson with POS CLAWS v7 and semantic category with WMatrix.
  • Parallel. That means L1 translation is available.
Brown Family brown_family_tt2 English 6,963,778 English corpus from Brown family including other corpora. Encoded in UTF-8. Tagged by TreeTagger v2.
  • Parallel. That means L1 translation is available.
e-flux (International art English) eflux_1 English 5,036,119 Web corpus of English art news digests containing 9538 art announcements released from March 1998 to May 2012 collected from e-flux. Tagged with Penn Treebank tagset v2.5.
  • Parallel. That means L1 translation is available.
  • Web.
Brexit corpus without retweets (English) brexit_dedup English 4,789,571 Corpus about Brexit from twitter and web without retweets.
  • Parallel. That means L1 translation is available.
Penn Corpora of Historical English pennHistEn English 3,800,639
  • Parallel. That means L1 translation is available.
Open American National Corpus (spoken) oanc_speech English 3,202,026 The Open American National corpus merged with The Manually Annotated Sub-Corpus. September 2016. Encoded in UTF-8, deduplicated, enriched by metadata. Tagged with TreeTagger pipeline v2.
  • Parallel. That means L1 translation is available.
  • Spoken.
Cambridge Academic English cupcamcae6 English 3,163,648
  • Parallel. That means L1 translation is available.
Ted Talks transcripts TED_en English 2,882,085 English corpus of transcripts of TED talks, prepared by Akshay Minocha. Tagged by TreeTagger with Penn Treebank tagset v2.5.
  • Parallel. That means L1 translation is available.
  • Spoken.
Multicultural London English Corpus lecorpus_3 English 2,391,040 Corpus of multicultural London English published by Jenny Cheshire et al in 2011. It consists of transcripts of informal conversation-like interviews with 1 or 2 speakers. Tagged with Penn Treebank tagset v2.5.
  • Parallel. That means L1 translation is available.
  • Spoken.
English Preposition Corpus english_preposition_corpus English 2,136,325
  • Parallel. That means L1 translation is available.
Oxford Children's Corpus 2015 -- Education occ15e English 1,323,174
  • Parallel. That means L1 translation is available.
British Academic Spoken English Corpus (BASE) base English 1,186,290 British Academic Spoken English corpus of lecture and seminar transcripts at two UK universities in 1998–2005. Encoded in UTF-8. Tagged by CLAWS v7.
  • Parallel. That means L1 translation is available.
  • Spoken.
Corpus of English Dialogues 1560–1760 english_dialogues English 1,151,171 Corpus of Early Modern English speech-related texts. It was compiled by Merja Kytö and Jonathan Culpeper, in collaboration with Terry Walker and Dawn Archer, at Uppsala and Lancaster Universities. Tagged with Penn Treebank tagset.
  • Parallel. That means L1 translation is available.
Brown brown_1 English 1,007,299 Brown University Standard Corpus of Present-Day American English. Encoded in UTF-8. Version from 1979. Tagged by TreeTagger v2.5.
  • Parallel. That means L1 translation is available.
Semcor v3.0 (sense-tagged corpus) semcor3_0 English 664,038 Sense-tagged English corpus created from texts of Brown corpus. Semantic analysis were perfmoed manually with WordNet 1.6 and late automatically mapped to WordNet 3.0 by Rada Mihalcea. The corpus contains marked multi-word expressions (MWE) prepared by Siva Reddy.
  • Parallel. That means L1 translation is available.
Opus MontenegrinSubs: English opusmonte_en English 468,337 Opus Montenegrin substitles: English part - more info in the paper Opus-MontenegrinSubs 1.0: First electronic corpus of the Montenegrin language.
  • Parallel. That means L1 translation is available.
Susanne susanne English 128,998 Susanne corpus of an approximately 130,000-word subset of the Brown Corpus of American English, annotated in accordance with the SUSANNE annotation scheme (attempting to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation). Tagged with TreeTagger v2.2.
  • Parallel. That means L1 translation is available.
  • Web.