CLEU- A Cross-Language-Urdu Corpus and Benchmark For Text Reuse Experiments
- Submitting institution
-
The University of Lancaster
- Unit of assessment
- 11 - Computer Science and Informatics
- Output identifier
- 250039108
- Type
- D - Journal article
- DOI
-
10.1002/asi.24074
- Title of journal
- Journal of the Association for Information Science and Technology
- Article number
- -
- First page
- 729
- Volume
- 70
- Issue
- 7
- ISSN
- 0002-8231
- Open access status
- Compliant
- Month of publication
- November
- Year of publication
- 2018
- URL
-
-
- Supplementary information
-
-
- Request cross-referral to
- -
- Output has been delayed by COVID-19
- No
- COVID-19 affected output statement
- -
- Forensic science
- No
- Criminology
- No
- Interdisciplinary
- No
- Number of additional authors
-
4
- Research group(s)
-
B - Data Science
- Citation count
- 0
- Proposed double-weighted
- No
- Reserve for an output with double weighting
- No
- Additional information
- Plagiarism detection tools are widely used in academia and beyond but very little research has focussed on the cross-lingual case. Urdu Natural Language Processing research is still very much in its infancy with very few resources available despite being a language with 70+ million native speakers. This paper combines both aspects to contribute a freely available novel resource that is fostering research in Urdu NLP. Using sound principles of representativeness informed from corpus linguistics, we created, encoded and analysed this resource. We have deliberately published as open access in a top journal to support visibility of the work.
- Author contribution statement
- -
- Non-English
- No
- English abstract
- -