AIDC - Monetize your datasets
Monetize your datasets. Build a side hustle as a data engineer.
Data engineers and data scientists are the architects of the future of AI. They are building the pipelines, the processing, the cleaning, and the structure that feeds the models delivering usable intelligence that we are leveraging. Now, with AIDC, you can monetize that from engagement to engagement, leveraging its maximum value and accelerating the pace of training and AI optimization for custom solutions. We make it easy to post and trade AI training datasets, with full compliance and contracts generated based on your criteria.
We will help you create your market account, get you set up with hosting if you need it, and upload your datasets with a drag-and-drop interface. When a purchase is made, and the contract is executed, payment is immediate. With our link infrastructure, there is no limit to dataset size. You can sell datasets one time, or you can sell subscriptions as you update data periodically.
The AI industry is experiencing an insatiable demand for high-quality, ethically sourced, clearly contracted, and governed datasets. The big AI’s have been trained on the common crawl and publicly available data, which is 3 plus petabytes. But the rest, over 100 petabytes, is ready to be harvested and structured by data engineers like you. AIDC gives you the marketplace where you can collaborate with others, to learn and continue to deliver more data to power our AI revolution.
Come join us.
dpd
Gen AI Copyright Act
Generative AI Copyright Act.
The “Generative AI Copyright Disclosure Act of 2024” has just been proposed in the U.S. House of Congress on April 9th, 2024 to address the issue of training generative AI systems on copyrighted material. This is a proposed bill by U.S. Californian Congressman Adam Schiff and as such has no registered number and is very short in its current draft form as it will go on through committee as it is reviewed and developed. Here is a link to the bill, I suggest you go read it as it’s 5 pages double spaced and only takes a minute.
In its current form, this bill supports the very thesis of AI Data CO-OP, both the foundation and the operating company. The proposed bill states that any dataset that is to be used for AI training, an AI training dataset, must post “a sufficiently detailed summary of any copyrighted works used.” The bill goes on to state this applies to the base dataset and any altered dataset. Also that a URL link to the dataset, if it is a public dataset, must be provided, like this one, https://huggingface.co/datasets/the_pile_books3 . Which is the dataset that got everyone in trouble in the first place.
The bill goes on to state that all datasets used to train a generative AI system must report their copyright contents to a central repository to be held at the copyright office no less than 30 days prior to the public availability of the generative AI system that datasets were used to train. And here lies the challenge. The more data you have the better generative AI solutions you have, and thus, we must be able to accurately track both the use of copyrighted and licensed data, not only for legal purposes but to properly compensate those who helped make the quality data that these systems need to be useful. Fortunately, here at AI Data CO-OP, both on the foundation side and the operating side we are building exactly that. Working with the copyright office on improved standards and tracking and working on the operating infrastructure to help accelerate innovation and keep moving the whole system forward.
dpd
Good Data Matters
Good data is essential for AI training.
On November 14th, two days before we launched the AI Data CO-OP initiative, Google DeepMind released GraphCast, their graph neural network (GNN) model for weather forecasting. It is yet another elegant application of deep learning by DeepMind and important to pay attention to here as it is a great example of the impact of good data on AI training. DeepMind’s creative work with the GNN is significant and you can see the models and even run them on the DeepMind web site, but let’s talk about the data.
Google leveraged 39 years of archived data from the European Center for Medium-Range Weather Forecasts (ECMWF), from 1979-2017 to train the GNN. This training took 4 weeks running on 32 cloud TPU v4’s in parallel. Note that you read in the press that you can run the model in a few minutes on a laptop, that’s the pretrained model and you are feeding that model updated information for two additional weather results. The trained model was tested on the two additional weather inputs against test data then from 2018 through present day and we have accuracy results that beat current weather prediction methods.
The ECMWF is an independent organization that works with 35 European countries and employs 450 staff to collect and administer weather and provide meteorological services to these members. The organization was established in 1975 and began collecting and curating data and sharing it with their partners. When they started, I am sure they had no expectations of what Google and DeepMind would be doing with their data in 2023, but we are blessed by their discipline and their diligence. What are the other areas we are missing for the future?
dpd
The AI Data CO-OP
Datasets created and curated as and asset and monetized by AI Data CO-OP for those who manage the data. Data builds the future.
Data is important, it always has been. But now, it is more important than ever. Well-managed and curated data, with provenance and history, is even more valuable. This data can be used for internal analysis, machine learning, AI and deep learning. But the truth is, data like this is rare, very rare. Even with all the systems controls that we all use, it takes time, effort, oversight and management to keep things up to a reasonable level of usability. Well-managed data is not free and for that, those who are managing and curating the data, should be recognized and if appropriate, compensated.
The last decade has brought a confluence of technologies together to enable us to create intelligence from data of different types. With cheaper compute, cloud storage and open source AI algorithms, a good share of future software solutions will be generated from data. This data will come from within the implemented systems and often merged with data from other sources. AI Data CO-OP is here to make it easier to bring this data together and solve problems in a more friction-free manner.
The domain of monetizing data sources for AI training is nascent and controversial. The goal of this non-profit initiative is to make it efficient for all parties involved. Whether aggregators evolve or they are simply operators who are creating massive amounts of data as part of their operations, their focus on making it available and viable for training will be valuable. Data that is kept up to date, secure, easily accessible, well documented and curated is an asset and should be managed as such.
If you are interested in joining me on this journey, let me know.
dpd