Geospatial machine learning software provider Picterra has launched a new data curation and exploration technology that allows users to get a better understanding of their datasets and improve model accuracy.
According to the company, this industry-first innovation enables organizations and Artificial Intelligence (AI) teams to get automated insights into their dataset and build more robust models with lower annotation costs.
The company said that visualizing data is the first step in any Machine Learning (ML) workflow and can often be challenging to perform when working with large and complex aerial imagery on a global scale.
Picterra’s Data Exploration Report is an innovation that can help users reveal visual patterns in their data and provide key insights for better and more robust detectors.
“Dataset exploration is a game changer for Picterra users,” said Julien Rebetez, Chief Technology Officer at Picterra. “It’s the first in a series of advanced data curation tools that will enable users to effortlessly take the performance of their detectors to the next level.”
Accessible alongside the training report, the Data Exploration Report allows a quick assessment of the training coverage and identifies areas where the user should concentrate on future iterations.
- Improve dataset quality to ensure the data covers the variety of appearances of an object that will be seen during production (e.g., ‘building on grass’, ‘building on snow’, etc). Better datasets lead to better models.
- Ensure validation set is representative: By making sure the validation set covers the variety of the dataset, the validation score is more representative of how well the model will perform in production on new data.
- Data curation: distribute and focus annotation effort on the dataset’s most impactful images/regions.
The features are based on unsupervised learning and clustering techniques and allow a user to evaluate the distribution of their dataset. This is important because it allows users to spot ‘annotation gaps’ in their datasets.
The report divides large imagery into small tiles before grouping similar tiles together based on their visual similarity (e.g., forest, water, urban, etc). These tiles are then visualized within the interactive report allowing users to understand which regions are covered by the current training dataset and make adjustments where necessary.
Dataset exploration can also be used for ‘data curation’ approaches. This is when you have a team of annotators and you need to assign them to images to annotate. By selecting the region to annotate using the Dataset Exploration Report, you make sure that you distribute the annotation workforce as efficiently as possible because they will annotate regions that maximize the diversity of appearance covered by the dataset, resulting in more robust detectors.
The following client example, using satellite imagery from Morocco, shows how the Data Exploration Report can be used to solve real-world problems. The goal of the detector, in this case, was to identify man-made holes used for reforestation—a natural solution to both preserve and strengthen biodiversity and combat climate change.
Following the initial detector training the Data Exploration Report was able to identify missing training coverage where the detector was not taught what the holes do not look like. Therefore the addition of empty training areas within the identified region reduces the risk of a higher rate of false positive detections when the detector is run at scale. A similar process can also ensure better accuracy area coverage.