Spoken corpus of Udmurt dialects

Link:

https://doi.org/10.25592/uhhfdm.17150

Autor/in:

Verlag/Körperschaft:

Universität Hamburg

Erscheinungsjahr:

2025

Medientyp:

Datensatz

Schlagworte:

Udmurt
dialectology
corpus
spoken corpus
transcripts
Udmurtia
Tatarstan
Russia

Beschreibung:

This deposit contains transcriptions of oral interviews and conversations in various dialects of Udmurt (Permic < Uralic; ISO 639-2 code udm). It contains 25 recordings with transcripts with a total of 93.6 thousand words.

Description of the contents

The contents are as follows:
- eaf (directory as ZIP archive): sound files and their transcripts in ELAN
- metadata_texts.csv: tab-delimited metadata for the transcriptions
- metadata_speakers.csv: tab-delimited metadata for speakers
- readme.txt: documentation
Transcriptions

All sound recordings are in WAV format, although some of them were originally recorded in a format with compression (see metadata). Transcriptions are stored in ELAN files. Each ELAN file is linked to one recording. The transcriptions were not thoroughly proofread and may contain mistakes. Please listen to the relevant segments to make sure their transcription is accurate. See readme.txt for further details.

Metadata

The transcript-level metadata are:
- filename (without the extension);
- code of the collector (TA: Timofey Arkhangelskiy; NA: Nikolai Anisimov; YZ: Iuliia Zubova);
- name of the place where recording was made (in Russian);
- original format of the recording (wav/wma/mp3);
- genre;
- date of the recording.
The speaker-level metadata are:
- code of the speaker;
- speaker type: native vs. (non-native) linguist;
- sex (F/M);
- year of birth (when known);
- variety of Udmurt they represent; usually this is the settlement where the speaker was born or spent their formative years.
The recordings were transcribed by Tatiana Anisimova and Nikolai Anisimov. Sound-alignment was performed by Timofey Arkhangelskiy and Marina Pankova.

References

ELAN (Version 6.9) [Computer software]. (2024). Nijmegen: Max Planck Institute for Psycholinguistics, The Language Archive. Retrieved from https://archive.mpi.nl/tla/elan

Contact

If you have any questions or would like to propose a collaboration, please email Timofey Arkhangelskiy at timarkh@gmail.com.
The preparation of the corpus, as well as collection of some of the data, was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grant — project no. 428175960.

Beziehungen:

DOI 10.25592/uhhfdm.17149

Lizenz:

info:eu-repo/semantics/restrictedAccess

Quellsystem:

Forschungsdatenrepositorium der UHH

Interne Metadaten

Quelldatensatz: oai:fdr.uni-hamburg.de:17150