SDG 13: Climate Action
2. Project Details
Company or Institution
SpaceML Worldview Search – The NoCode Earth Data Curator from Unlabeled Petabyte Scale Imagery
General description of the AI solution
One of the initial and crucial steps for a scientific study related to climate change and natural disasters, including wildfires, oil spills, hurricanes, dust storms, etc, involves scientists gathering large number of relevant examples. Locating these examples requires painstakingly inspecting 197 million square miles of satellite imagery each day across more than 20 years. While such an effort can produce a valuable trove of data, the act of manually searching is laborious, expensive, and often impractical – grounding many scientific studies before they could ever take off.
This project aims to let a scientist use a single query image of a climate event to ultimately build a curated dataset containing an exhaustive collection of data of the same type of event.
The scientist first queries using the initial image in order to retrieve a set of candidate matches. The system allows picking the relevant examples, and then proceeds to retrieve the most uncertain examples for further labeling. This can happen for multiple rounds until the background classifier becomes high performance and then goes over the entire dataset recommending the examples its sure about, thereby building an exhaustive dataset with significantly less effort. The aim of this open source project is to enable a scientist to accomplish all of this without them having to write a single line of code or possess any prerequisite knowledge of AI. This further reduces barriers to entry and enables researchers across all disciplines to benefit from the project.
Developed for scale and cost-effectiveness, this first-of-its-kind volunteer open science project has been built by an inclusive team of 28 citizen scientists from 7 countries, ranging from high school to undergraduate students, as well as English teachers transitioning into data science. Industry mentors from companies like Pinterest, NVIDIA, Netflix, Twitter mentor students pro-bono from getting started in AI to software development to writing publications.
Excellence and Scientific Quality: Please detail the improvements made by the nominee or the nominees’ team or yourself if your applying for the award, and why they have been a success.
Similarity Search is a known application of AI. But there are several challenges in making it work for remote sensing:
– unlabeled data making a supervised network untrainable
– failure of ImageNet pre-trained networks to generalize to multi-band data
– different physical sizes of phenomena (ranging from 500 km hurricanes to 1 km wildfires)
– vast imbalances in concepts – 70% being oceans while 0.00001% being hurricanes
(1) SSL: We use advances in Self-Supervised learning (SimSiam, SimClr) to train on unlabeled data, tuned to multiband data.
(2) Imbalance: To solve training on imbalance data while having no labels, the system trains an SSL model, generates embeddings, uses a CoreSet sampling approach to pick points in the embedding space (effectively equidistant embeddings) selecting more diverse, balanced clusters – and does this iteratively, turning into a much higher performance model.
(3) Similarity Search: The system then uses similarity search, showcases a few similar examples to researchers in order to build a seed dataset of positive and negative images
(4) Active Learning: Using this training set, an active labeling approach is used to find examples that the classifier is most uncertain about, labeled by the scientist, training iteratively leading to a higher performance classifier. Eventually, this classifier is used to identify images with high probability of being of interest to the scientist, and labeled – reducing the effort vastly.
To scale up in cloud for cheap, (1)embeddings for the full pool of data are generated once (2) a smaller representative sample using CoreSet is generated (3) local trained classifier is run on this representative pool (in minutes) (4) for any confusing points identified, similarity search for neighbors retrieves more confusing examples in neighborhood. Considering each forward pass on a 10B collection would otherwise cost $30K/21K hours, this approach allows “Access to Billions, cost in Pennies”.
Full technical video presentation explaining the project in an accessible manner:
Scaling of impact to SDGs: Please detail how many citizens/communities and/or researchers/businesses this has had or can have a positive impact on, including particular groups where applicable and to what extent.
While NASA satellites have collected 40 PB and are expecting to grow to 250 PB by 2025, the full potential of this data is locked due to being unlabeled. This project opens the floodgates to research on this unlabeled data, by allowing scientists to curate datasets specific to their topics of interest, hence acting as an impact multiplier for SDG 13.
The uniqueness of this project is that it unlike focusing on a single area like floods, wildfires or oil spills only, it instead enable the experts working on a range of problems to get datasets of need quickly at low cost. Since data is fundamentally the most important part to start any scientific study, the project brings a global added value by helping build a larger funnel to possibilities of attempts to study climate change and hence hopes to deliver a multiplicative impact in several areas as the adoption grows.
NASA measures impact on a Technology Readiness Level (TRL) scale of 1-9 (with 9 considered as a flight proven system deployed on successful mission). While most research papers end at a TRL 3, this project is currently at TRL 6 and aiming to reach TRL 7 by Nov 2021.
NASA has done a feature story detailing this impact:
Scaling of AI solution: Please detail what proof of concept or implementations can you show now in terms of its efficacy and how the solution can be scaled to provide a global impact ad how realistic that scaling is.
* “During a recent demonstration of the GIBS/Worldview imagery pipeline, a machine was trained to search for islands through five million tiles of Earth imagery starting with a single seed image of an island. Approximately 1,000 islands were identified in just 52 minutes. If done manually, this effort would take an estimated 7,000 hours (assuming five seconds to evaluate and label each image tile) and potentially cost as much as $105,000 (assuming $15 per hour).” – Nasa.gov
* SpaceML high school students won a NASA Science Mission Directorate's grant on Groundbreaking Science – top 5 proposals among 79 research groups.
* In a first of its kind, a student team being invited to deliver a talk on their innovations at NASA headquarters – twice, including program directors of different divisions beyond earth science including Chief Science Data Officer to show the interdisciplinary potential.
* NASA IMPACT Team scientists are currently using SpaceML tools in their daily workflows. Example tools publicly visible include GIBS Downloader, Self-Supervised Trainer, Swipe Labeler
Over 15 Terabytes of data already downloaded with GIBS Downloader.
* For wider public use, NASA IMPACT Team is currently incorporated SpaceML pipeline in upcoming phenomena portal for bringing more public awareness (Releasing Nov 2021)
* NASA measures impact on a Technology Readiness Level (TRL) scale of 1-9 (with 9 considered as a flight proven system deployed on successful mission). While most research papers end at a TRL 3, this project is currently at TRL 6 and aiming to reach 7 by Nov 2021.
* Using the modular pipeline has opened funded research for other interdisciplinary problems, like a team working on Hubble Space Telescope.
* Jeffrey Smith from SETI was quickly able to quickly assess value for planetary data (from Voyager and Cessini spacecrafts) and build 5-year mission proposal for planetary search.
* 6 accepted talks from high school students at COSPAR 2021 Conference, workshop on Machine Learning for Space Sciences.
* During a demonstration call with NASA scientists to show the power of collaborative, ~500 items were labeled in 73 seconds using the GUI based Swipe Labeler ( https://github.com/spaceml-org/Swipe-Labeler )
* Contributors invited to speak at UN’s Third ITU/WMO/UNEP Workshop on Artificial Intelligence for Natural Disaster Management
Ethical aspect: Please detail the way the solution addresses any of the main ethical aspects, including trustworthiness, bias, gender issues, etc.
The project addresses inclusion, diversity, and ethics in several ways, both from the user, contributor, and technology angle:
(1) Inclusive to the scientific user audience – The project aims to lower the barrier to entry for usage by making tools runnable with a single line command, without the need for the researchers to have AI knowledge or even programming. And with some tools having a graphical user interface (Like a Chrome extension to search for examples of natural disasters), users beyond researchers can be inspired to understand the effects of climate change.
(2) Inclusive to budget – The project aims to significantly lower the cost of conducting research studies, so researchers don’t have to file and wait for large grants. Firstly the project helps in finding the data to conduct the study quickly. Secondly, high-performance modules maximize the utilization of the available hardware (from free Colab notebooks to multi-node clusters) without any user knowledge of AI. This type of tuning usually requires deep AI practitioner experience. This is part of the reason how the entire project was built for experiments for almost no cost utilizing free Colab notebooks.
(3) Inclusive to contributor background, changing career direction – Most opportunities to work directly with NASA are very selective, often available to researchers with advanced educational backgrounds, with the most common starting positions being Postdoctoral positions. And while the funnel of students in advanced STEM fields is already low, it's even lower for women and people of color. SpaceML helps connect aspiring changemakers with the opportunity to make an outsized impact. It does that by inspiring them through talks, training them, open opportunities to conduct research in a state of the art field, guiding them through generating publications and releasing free open-sourcing tools, and then giving them the stage to showcase their work in front of NASA Scientists, significantly accelerate the speed of usable research for NASA and its adoption by scientists. And it does it inclusively, irrespective of academic background. With 28 contributors in 1 year (starting Aug 2020) from India, Mexico, South Korea, Canada, Germany, UK, and the USA, most contributors have a high school / undergraduate background (split equally). And with the results, the students have been scoring open offers for future NASA internships. And even two English teachers who, after being motivated by climate change, went through career transition into data science, with one having landed in a full-time technical role. And more importantly, with stories of relatable young changemakers, more people will follow onto using their talents for social good.
(4) Reducing bias algorithmically: Earth imagery is highly imbalanced, with ~70% being oceans while less than 0.00001% being hurricanes, illustrating an extreme case of bias – to the extent of making many models untrainable for any robust usage. We devised a technique of diverse data sampling (using CoreSet sampling from embeddings) to pick the most unique data samples iteratively while training smaller models and then training a final model with the diverse data. This reduction in data bias led to several classifiers moving from 35% accuracy to 89%, making previously untrainable models trainable.