<img src="https://www.365-syndicate.com/803002.png" style="display:none;">
Skip to Content logo-full-color logo-on-dark
  • Solutions
    • For Asset Managers
    • For Building Managers
  • Resources
    • Blog
    • Press
    • Our Customers
  • About
  • Careers
  • English
  • Norsk
  • Log In
  • Book demo

We made 84,000 climate observations searchable using language models

Eivind Kjosbakken June 5, 2025
 

83124831

At Findable, we specialize in structuring and understanding documentation related to buildings. This requires deep expertise in artificial intelligence, image processing, and the handling of unstructured data. Our mission is to help clients move from chaos to control. But in this side project, we applied our skills to something completely different: a unique, handwritten phenology dataset from the early 20th century.

Historical and climate relevance

Phenology is the study of seasonal natural phenomena – when the snow melts, flowers bloom, or migratory birds return. From 1928 to 1952, Professor Henrik Printz led a nationwide effort to document these changes across Norway. The result was nearly 84,000 handwritten records of biological and climatic events, manually entered into tables by teachers and volunteers across the country. Until now, this dataset was impossible to analyze digitally.

“We’ve greatly benefited from open-source tools in our own work,” says Findable co-founder and Head of Research, Lars Aurdal. “That’s why we wanted to give something back.”

Together with Eivind Kjosbakken, Data Scientist at Findable, Lars took on this project to demonstrate how their document analysis expertise can be applied across domains.

“We chose to digitize and analyze this dataset so that researchers and others can now explore how climate change has affected seasonal patterns over the past century,” Lars explains.

Illegible tables and handwritten numbers

The scanned pages were in double-page format – often skewed and always handwritten. The handwriting was small, cells were densely packed, and many characters were difficult to read, even for humans. Making these observations machine-readable required a meticulous approach involving everything from image preprocessing to the fine-tuning of large vision-language models (vLLMs).

Phenology tabels

The tables in the dataset were scanned as double-page spreads—often skewed, and always handwritten.

“We started with classical image preprocessing – splitting, rotating, and correcting the scans – and used morphological filters to identify individual cells,” says Lars.

Once each cell was isolated as an image, Eivind took over. He fed the cropped cells into a specialized vision-language model, Qwen 2.5 VL, which was fine-tuned to interpret the handwriting.

“We used Unsloth, an efficient fine-tuning framework for LLMs, to train the model on this specific dataset. For instance, only certain digits and letters were valid in specific columns. We taught the model that ‘1’ always had a diagonal stroke, and ‘7’ had a crossbar – details that were crucial for accurate interpretation,” he explains.

Understanding the model’s failure points

Before fine-tuning, the dataset was manually reviewed to understand common error sources.

Phenology data
We extracted text from this type of image using Qwen 2.5 VL. These cells were extracted from tables like the one shown above, using image processing techniques.

“‘1’ and ‘7’ were frequently confused, but we also found issues like noisy scans, faint handwriting, and table borders being mistaken for characters. We used this knowledge to prepare the training data and help the model learn what to expect – and what not to,” Eivind says.

High-precision digitization of nearly 84,000 observations

Through a combination of image analysis, annotation, fine-tuning, and validation, the team succeeded in making an almost unreadable dataset machine-readable – with high precision.

“We’re making the dataset openly available, so anyone can explore how nature’s rhythms have changed over time,” says Lars.

The project illustrates how vision-language models can solve complex document challenges – the same kind we address daily in the real estate and construction sector.

For society, the project opens access to a previously untapped source of historical climate data.

“What can we learn from shifts in flowering dates or tree lines over time?” Lars asks.

“The big picture is that we can now unlock information that has been gathering dust in basements for decades – and learn from it in ways that were never possible before.”

 

Ready to get started?

Recent Posts

Residential Real Estate 3 min
AI in real estate: three low-hanging fruits for immediate impact
Fredrik Wisløff February 04, 2025
Building Safety Act 3 min
Building a culture of safety and trust through the Golden Thread
Paul Vain January 27, 2025
 
logo-full-color

Findable AS
Org. No. 926086758
Chr. Krohgs gate 16 0186 Oslo, Norway 
hei@findable.no 
Findable UK
Fora, 201 Borough High St, London SE1 1JA

  • We're hiring
  • Book demo
  • Security
©2025 Findable AS All rights reserved.
  • Terms of Service
  • Contact
  • LinkedIn