Corpws Cenedlaethol Cymraeg Cyfoes (CorCenCC) corpus and query tools
- Submitting institution
-
Swansea University / Prifysgol Abertawe
- Unit of assessment
- 26 - Modern Languages and Linguistics
- Output identifier
- 55093
- Type
- S - Research data sets and databases
- DOI
-
10.17035/d.2020.0119878310
- Location
- https://www.corcencc.org/
- Month
- October
- Year
- 2020
- URL
-
https://www.corcencc.org/
- Supplementary information
-
-
- Request cross-referral to
- -
- Output has been delayed by COVID-19
- No
- COVID-19 affected output statement
- -
- Forensic science
- No
- Criminology
- No
- Interdisciplinary
- Yes
- Number of additional authors
-
26
- Research group(s)
-
-
- Proposed double-weighted
- Yes
- Double-weighted statement
- This submission is the main output from a major AHRC/ESRC four year funded project (ES/M011348/1).
It constitutes a data set of 14,338,149 tokens (circa 11.2-million-words), collected according to a principled sampling frame and submitted to processes of anonymisation, transcription, semantic tagging (using bespoke tool SemCyTag) and Part-of-Speech (POS) tagging (using bespoke tool CyTag). In addition to the corpus (the first of its kind for Welsh Language), the output includes supporting documentation and information, including a project report of approx 19,000 words. All elements of the output are presented bilingually (in English and Welsh).
- Reserve for an output with double weighting
- No
- Additional information
- -
- Author contribution statement
- -
- Non-English
- No
- English abstract
- -