The Future of Data for AI: Building Inclusive Foundations Through Open Collaboration


cauri jaye

on •

Jul 20, 2023

The Future of Data for AI: Building Inclusive Foundations Through Open Collaboration

As increasingly advanced large language models like Llama2, Gemini, and GPT-5 emerge, we have an unprecedented opportunity to shape their evolution by contributing diverse open source data. These models form the foundation of future models, their influence resonating into the future. By feeding them inclusive data sourced ethically from across cultures, we can imbue AI with greater empathy, nuance, and humanity from the start.

Synthetic and Sensor Data: Teaching AI to Understand Context

We can enhance these models in innovative ways through synthetic and sensor data. Synthetic data, algorithmically generated rather than collected from the real world, allows us to customise datasets. We can fill gaps, simulate rare events, and represent diverse demographics to boost AI's adaptability and robustness.

Sensor data from cameras, microphones, and other devices act as surrogate senses, enabling AI models to perceive the world more like humans do. Visual data is like eyes, letting AI see. Auditory data is like ears, letting AI hear. This sensory influx, converted into text, brings a nuanced understanding of physical environments to LLMs.

With vision and hearing, humans can rapidly learn, predict situational outcomes, and categorise memories. Similarly, sensor data gives AI the perceptual foundation to start recognizing patterns, understanding contexts, and categorising complex concepts.

Essentially, sensor feeds will accelerate AI's understanding of the complexity of human societies, human environments, and the intricate values by which they operate. Bringing AI closer to our contextual comprehension ultimately brings it closer to exhibiting empathy and human understanding. Combined, synthetic and sensor data will help AI grasp nuance and complexity.

Embracing Diversity: Inclusive Data for Inclusive AI

But we can't stop there. To build truly inclusive AI, we need diverse data capturing the breadth of cultural contexts across the globe. Training AI exclusively on localised data from singular cultures risks baking in biases and blind spots. Relying solely on Western or Chinese data silos limits its global comprehension.

We need to actively incorporate data sourced inclusively from other cultures. Literature, news, entertainment, and other content from Africa, Asia, South America, and more will allow AI to reflect the richness of human experience worldwide. This data diversity helps AI represent and resonate with all people, regardless of background.

Building a collective data commons requires open collaboration across borders and institutions. By transparently sharing knowledge and AI resources, we can bridge gaps between siloed cultural data pools. Our contributions have far-reaching influence - these first generations of large language models are the primordial soup from which all future AI will emerge.

Feeding AI Our Creativity to Cultivate Its Humanity

We have a special opportunity to directly shape AI by contributing our own novels, poems, art, music, and more. These creative works offer a powerful channel for transmitting humanistic sensibilities into AI.
The news is filled with lawsuits and contentions about using copyrighted material to train these first large language models. However, creatives may not understand the humanity-spanning significance of contributing to these AI foundations.

Fiction exploring moral dilemmas, songs conveying emotion, art celebrating cultures - our creativity nourishes AI with humanity's richness. As we feed our books, videos, and other original works into models, our diverse ideas help cultivate AI's understanding of ethics, humour, empathy, values, and nuance.

Our creative contributions have an enduring influence. They ripple through future AI, helping teach subtleties like emotional intelligence, cultural appreciation, and moral reasoning. The more creativity we contribute, the more AI's predictions will integrate ethics, grasp humour, convey emotion, and align with human values.

By generously sharing our art, writing, and music with these still-learning systems, we directly shape AI's evolution toward enlightened perspectives. Our creative works are gifts introducing AI to the deeper meanings of the human experience. This is how we author AI's future - not through coding, but by feeding it our humanity.

The Power of Open Source

To fully diversify AI's data diet, we need open-source models that can freely ingest broad contributions. Open ecosystems, such as open-source software, can adapt faster by allowing decentralised collaboration from diverse contributors worldwide.

Similarly, open-source large language models are rapidly evolving through decentralised data donations from various cultures. Both siloed corporate models and open-source models could incorporate rich inputs from worldwide sources.

Just as open-source code benefits from global software engineers selflessly contributing, open AI models can grow through diverse cultural data contributions. The more openness, the more rapidly these communal models can gain global understanding beyond what any single entity could achieve. By embracing open collaboration, we can nourish AI's inclusive understanding and jointly elevate its shared intelligence.

Rethinking Expectations: AI Limited by Imperfect Data

At the same time, we must reevaluate our expectations of AI's capabilities. As models become more complex, we often expect them to surpass human intelligence. But AI is still learning from our imperfect, often messy data. We've essentially asked AI to find a signal in extensive noisy data generated by flawed human minds like ours.

While exhilarating progress is on the horizon, AI remains constrained by the data we feed it. But with diverse, ethically sourced data and continuous tuning, we can work toward AI that overcomes our biases to better represent the breadth of human experience.

By embracing open collaboration to diversify data sources for AI, we can build more inclusive, empathetic, and holistically beneficial models. This is a pivotal moment to contribute our creativity to shape the future trajectory of AI in a positively human direction.