We made 84,000 climate observations searchable using language models
At Findable, we specialise in structuring and understanding documentation related to buildings. This requires deep expertise in artificial intelligence, image processing, and handling unstructured data. Our mission involves helping clients transition from chaos to control. However, in this side project, we applied our skills to something entirely different: a unique, handwritten phenology dataset from the early 20th century.
Historical and Climate Relevance
Phenology is the study of seasonal natural phenomena—when snow melts, flowers bloom, or migratory birds return. From 1928 to 1952, Professor Henrik Printz led a nationwide effort to document these changes across Norway. The result was nearly 84,000 handwritten records of biological and climatic events, manually entered into tables by teachers and volunteers nationwide. Until now, this dataset was impossible to analyse digitally.
“We’ve greatly benefited from open-source tools in our own work,” says Findable co-founder and Head of Research, Lars Aurdal. “That’s why we wanted to give something back.”
Together with Eivind Kjosbakken, Data Scientist at Findable, Lars undertook this project to demonstrate how their document analysis expertise applies across domains.
“We chose to digitise and analyse this dataset so researchers can explore how climate change has affected seasonal patterns over the past century,” Lars explains.
Illegible Tables and Handwritten Numbers
The scanned pages were in double-page format—often skewed and always handwritten. The handwriting was small, cells were densely packed, and many characters were difficult to read, even for humans. Making these observations machine-readable required a meticulous approach involving image preprocessing and fine-tuning of large vision-language models (vLLMs).
“We started with classical image preprocessing—splitting, rotating, and correcting the scans—and used morphological filters to identify individual cells,” says Lars.
Once each cell was isolated as an image, Eivind took over. He fed the cropped cells into Qwen 2.5 VL, a specialised vision-language model fine-tuned to interpret the handwriting.
“We used Unsloth, an efficient fine-tuning framework for LLMs, to train the model on this specific dataset. Only certain digits and letters were valid in specific columns. We taught the model that ‘1’ always had a diagonal stroke, and ‘7’ had a crossbar—details crucial for accurate interpretation,” he explains.
Understanding the Model’s Failure Points
Before fine-tuning, the dataset was manually reviewed to understand common error sources.
“‘1’ and ‘7’ were frequently confused, but we also found issues like noisy scans, faint handwriting, and table borders being mistaken for characters. We used this knowledge to prepare the training data and help the model learn what to expect—and what not to,” Eivind says.
High-Precision Digitization of Nearly 84,000 Observations
Through a combination of image analysis, annotation, fine-tuning, and validation, the team succeeded in making an almost unreadable dataset machine-readable with high precision.
“We’re making the dataset openly available, so anyone can explore how nature’s rhythms have changed over time,” says Lars.
The project illustrates how vision-language models can solve complex document challenges—the same kind addressed daily in the real estate and construction sectors.
For society, the project opens access to a previously untapped source of historical climate data.
“What can we learn from shifts in flowering dates or tree lines over time?” Lars asks. “The big picture is that we can now unlock information that has been gathering dust in basements for decades—and learn from it in ways that were never possible before.”