CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes) – the National Corpus of Contemporary Welsh
- Submitting institution
-
Cardiff University / Prifysgol Caerdydd
- Unit of assessment
- 27 - English Language and Literature
- Output identifier
- 121781994
- Type
- T - Other
- DOI
-
-
- Location
- -
- Brief description of type
- CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes) – the National Corpus of Contemporary Welsh
- Open access status
- -
- Month
- -
- Year
- 2020
- URL
-
-
- Supplementary information
-
-
- Request cross-referral to
- -
- Output has been delayed by COVID-19
- No
- COVID-19 affected output statement
- -
- Forensic science
- No
- Criminology
- No
- Interdisciplinary
- No
- Number of additional authors
-
26
- Research group(s)
-
-
- Proposed double-weighted
- Yes
- Double-weighted statement
- The CorCenCC corpus is a substantial research output, containing over 11 million words. Gathering this material across spoken, written and electronic mediums while representing all genres, styles, registers and dialect regions entailed time-consuming engagement with a broad range of data sources and Welsh-speaking informants. Substantial time and effort were invested in designing new software to handle Welsh language tagging, a crowdsourcing data collection app, corpus query tools and a bespoke educational interface for Welsh learners and teachers. Directed by Knight, the project team comprised 37 academic investigators, consultants, RAs and advisors. For the PI’s extensive responsibilities see Resource Overview statement.
- Reserve for an output with double weighting
- No
- Additional information
- CorCenCC is the first corpus of contemporary spoken, written and electronically-mediated Welsh. ESRC/AHRC funded (2017-2020), CorCenCC was created in consultation with stakeholders including the Welsh Government, Welsh National Library, BBC and S4C.
• With over 11 million words of spoken, written and e-language, CorCenCC represents all major genres, (regional/social) varieties and contexts of contemporary Welsh use.
• Its user-driven model is transformative, providing a template and infrastructure for future corpus development in any language.
• Its unique crowdsourcing app, designed to complement more traditional methods of data collection, align corpus construction methods with the Web 2.0 age.
• Its integrated pedagogic toolkit provides data-driven exercises for teachers and learners of Welsh.
• Its new software tools, including Welsh-specific part-of-speech and semantic taggers, tagsets and corpus query tools, are available under open licence, so can be adapted to any language.
This interdisciplinary and inter-institutional project was directed from Cardiff’s School of English, Communication and Philosophy by Knight as PI. Knight made significant research contributions to every facet of the project, such as developing the initial vision for the corpus, including its design, purpose and education-facing applications, conceptualising the user-driven approach, establishing the sampling frame and bespoke transcription conventions, co-designing the novel data management tools, user interface, and functionalities of the crowdsourcing app and querying tools, overseeing the development of tagging tools, managing the project, curating outputs, reports and webpages, presenting at public and stakeholder engagement events, giving 46 academic presentations in 10 countries (including 14 plenaries/invited talks) and writing/contributing to 8 publications.
While the CorCenCC report (https://arxiv.org/abs/2010.05542) contextualises the research process of the project, the ‘Developing computational infrastructure for the CorCenCC corpus’ article is included in this output submission to provide a detailed technical description of the work.
Visit the project website to access the corpus (www.corcencc.org/download) and pedagogic tools (https://ytiwtiadur.corcencc.org).
- Author contribution statement
- -
- Non-English
- No
- English abstract
- -