In this work, we explore Vision Language Models to describe a collection of early photographs of Japan. The collection consists of 18,155 Japanese photographs from various sources including personal collections, and was described with 2,209 different keywords during a previous study. The descriptive tags cover a wide range of identifiers, including content terms (such as 'child', 'river', 'fan'), elements of Japanese culture (such as 'geisha', 'shamisen', 'torii'), types of photographs ('group portrait', 'landscape'), and specific Japanese locations or monuments (Nikko, Yokohama, Toshogu Shrine). Three models are evaluated on this set and show that ChatGPT achieves high precision, unlike CLIP and SigLIP.
Check out our slides presented at DH24 here.