Historical written artefacts are multi-dimensional objects with several modalities, typically analysed separately by dedicated computational systems. These modalities are generated as research data from the study of artefacts, including digital images, measurements of material properties, and meta data from historical contexts. In most cases, these modalities are interrelated and interdependent. Therefore, understanding the relationship and learning to associate between different modalities can be essential for a holistic understanding beyond the textual contents of historical written artefacts. Recent advancements in research on multimodal models offer the possibility of analysing the different modalities of historical artefacts and modelling the relationships between them. Such models can be used by scholars for tasks such as text-based image retrieval and visual question answering. This work aims explore the potential of utilising multimodal models, and expressing the different modalities in research data of historical written artefacts in image and text formats, so that vision-language models can be employed.