One of the challenges at Agtuary is not the lack of data; instead, it is its abundance. There are many rich, publicly available datasets that the agricultural industry can use to make better decisions, including rainfall and temperature observations, satellite imagery, and national soil models - not to mention the vast amount of data also collected on-farm. Considering 20 years or more of historical data, it can quickly add up to many terabytes of information on a national scale, leading to a big-data problem. How do we handle all of this data to transform it into a usable format for our users? Why does my processing algorithm crash? Or, why is this taking so long?
A large proportion of data that we use is geolocated. This geospatial aspect of the data introduces a second problem; how do we extract data relevant to a particular farm or location? There are many GIS (geographic information system) methods that can help us query the data based on, for instance, publicly available property boundaries.
This article will discuss some of the technologies and techniques we use to overcome these big data and geospatial challenges.
Our geospatial-big-data problem essentially boils down to; "damn, we have a lot of large raster files...". The go-to for dealing with rasterized information in python is to use NumPy arrays. With very large rasters, the array objects can often cause memory problems making the development process annoying and slow as it's hard to debug and identify where the problem has occurred.
We recently encountered one of these problems when assigning our growth classification models (we call our secret curves and spices) to high-resolution time-series NDVI rasters (see Matt's article). As this is to be applied on a national level, we needed the ability to perform R&D locally (on a developer's computer) and have the ability to scale up the processing. While developing, we would often experience the "Killed" message that python outputs when the interpreter decides that there is insufficient memory to perform the required task.
We use "chunking" to overcome these memory problems. Chunking allows the data arrays to be divided into smaller, bite-sized, memory-friendly chunks. We also use particular array objects to have the correct georeferencing for each chunk that needs to be processed. Because the data has been divided into chunks, we can process the data in parallel across many machines (called workers). Once processed, the data can be saved in chunk format, allowing other tasks to be easily performed on the results in the same distributed way. Finally, we can easily abstract the data via distributed statistical techniques into a meaningful and memory-friendly format to be served via an API. This API allows other team members to develop impactful features for our uses without worrying about these big-data problems.
These techniques are great, but it's our team dynamic that allows us to overcome these challenges. We have the necessary skills to be effective geospatial-big-data problem solvers, including:
I really appreciate working with our talented team at Agtuary, where we have brought together the right blend of software engineering, data science, agronomic & mathematical skills necessary for solving these challenges. The technology that we are producing makes dealing with common big data problems in Ag-Tech efficient and robust. In the next instalment, I'll share with you the pain of automating the above processes.
Agtuary provides ag-tech solutions such as analytics API's, website integrations, and property assessment tools. If your organization requires our existing or custom solutions, please reach out to us at email@example.com to arrange a demo.