Spoken corpus of Udmurt dialects

Link:
Autor/in:
Verlag/Körperschaft:
Universität Hamburg
Erscheinungsjahr:
2025
Medientyp:
Datensatz
Schlagworte:
  • Udmurt
  • dialectology
  • corpus
  • spoken corpus
  • transcripts
  • Udmurtia
  • Tatarstan
  • Russia
Beschreibung:
  • This deposit contains transcriptions of oral interviews and conversations in various dialects of Udmurt (Permic < Uralic; ISO 639-2 code udm). It contains 25 recordings with transcripts with a total of 93.6 thousand words.

    Description of the contents

    The contents are as follows:

    • eaf (directory as ZIP archive): sound files and their transcripts in ELAN
    • metadata_texts.csv: tab-delimited metadata for the transcriptions
    • metadata_speakers.csv: tab-delimited metadata for speakers
    • readme.txt: documentation

    Transcriptions

    All sound recordings are in WAV format, although some of them were originally recorded in a format with compression (see metadata). Transcriptions are stored in ELAN files. Each ELAN file is linked to one recording. The transcriptions were not thoroughly proofread and may contain mistakes. Please listen to the relevant segments to make sure their transcription is accurate. See readme.txt for further details.

    Metadata

    The transcript-level metadata are:

    • filename (without the extension);
    • code of the collector (TA: Timofey Arkhangelskiy; NA: Nikolai Anisimov; YZ: Iuliia Zubova);
    • name of the place where recording was made (in Russian);
    • original format of the recording (wav/wma/mp3);
    • genre;
    • date of the recording.

    The speaker-level metadata are:

    • code of the speaker;
    • speaker type: native vs. (non-native) linguist;
    • sex (F/M);
    • year of birth (when known);
    • variety of Udmurt they represent; usually this is the settlement where the speaker was born or spent their formative years.

    The recordings were transcribed by Tatiana Anisimova and Nikolai Anisimov. Sound-alignment was performed by Timofey Arkhangelskiy and Marina Pankova.

    References

    ELAN (Version 6.9) [Computer software]. (2024). Nijmegen: Max Planck Institute for Psycholinguistics, The Language Archive. Retrieved from https://archive.mpi.nl/tla/elan

    Contact

    If you have any questions or would like to propose a collaboration, please email Timofey Arkhangelskiy at timarkh@gmail.com.

  • The preparation of the corpus, as well as collection of some of the data, was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grant — project no. 428175960.
Beziehungen:
DOI 10.25592/uhhfdm.17149
Lizenz:
  • info:eu-repo/semantics/restrictedAccess
Quellsystem:
Forschungsdatenrepositorium der UHH

Interne Metadaten
Quelldatensatz
oai:fdr.uni-hamburg.de:17150