We made 84,000 climate observations searchable using language models

Written by Eivind Kjosbakken | Jun 5, 2025 12:12:25 PM

At Findable, we specialize in structuring and understanding documentation related to buildings. This requires deep expertise in artificial intelligence, image processing, and the handling of unstructured data. Our mission is to help clients move from chaos to control. But in this side project, we applied our skills to something completely different: a unique, handwritten phenology dataset from the early 20th century.

Historical and climate relevance

Phenology is the study of seasonal natural phenomena – when the snow melts, flowers bloom, or migratory birds return. From 1928 to 1952, Professor Henrik Printz led a nationwide effort to document these changes across Norway. The result was nearly 84,000 handwritten records of biological and climatic events, manually entered into tables by teachers and volunteers across the country. Until now, this dataset was impossible to analyze digitally.

“We’ve greatly benefited from open-source tools in our own work,” says Findable co-founder and Head of Research, Lars Aurdal. “That’s why we wanted to give something back.”

Together with Eivind Kjosbakken, Data Scientist at Findable, Lars took on this project to demonstrate how their document analysis expertise can be applied across domains.

“We chose to digitize and analyze this dataset so that researchers and others can now explore how climate change has affected seasonal patterns over the past century,” Lars explains.

Illegible tables and handwritten numbers

The scanned pages were in double-page format – often skewed and always handwritten. The handwriting was small, cells were densely packed, and many characters were difficult to read, even for humans. Making these observations machine-readable required a meticulous approach involving everything from image preprocessing to the fine-tuning of large vision-language models (vLLMs).

The tables in the dataset were scanned as double-page spreads—often skewed, and always handwritten.

“We started with classical image preprocessing – splitting, rotating, and correcting the scans – and used morphological filters to identify individual cells,” says Lars.

Once each cell was isolated as an image, Eivind took over. He fed the cropped cells into a specialized vision-language model, Qwen 2.5 VL, which was fine-tuned to interpret the handwriting.

“We used Unsloth, an efficient fine-tuning framework for LLMs, to train the model on this specific dataset. For instance, only certain digits and letters were valid in specific columns. We taught the model that ‘1’ always had a diagonal stroke, and ‘7’ had a crossbar – details that were crucial for accurate interpretation,” he explains.

Understanding the model’s failure points

Before fine-tuning, the dataset was manually reviewed to understand common error sources.

We extracted text from this type of image using Qwen 2.5 VL. These cells were extracted from tables like the one shown above, using image processing techniques.

“‘1’ and ‘7’ were frequently confused, but we also found issues like noisy scans, faint handwriting, and table borders being mistaken for characters. We used this knowledge to prepare the training data and help the model learn what to expect – and what not to,” Eivind says.

High-precision digitization of nearly 84,000 observations

Through a combination of image analysis, annotation, fine-tuning, and validation, the team succeeded in making an almost unreadable dataset machine-readable – with high precision.

“We’re making the dataset openly available, so anyone can explore how nature’s rhythms have changed over time,” says Lars.

The project illustrates how vision-language models can solve complex document challenges – the same kind we address daily in the real estate and construction sector.

For society, the project opens access to a previously untapped source of historical climate data.

“What can we learn from shifts in flowering dates or tree lines over time?” Lars asks.

“The big picture is that we can now unlock information that has been gathering dust in basements for decades – and learn from it in ways that were never possible before.”

View full post