By Theo Turner
This article summarises machine learning and its relation to data silos before introducing Outlier Ventures’ first step into apps in the Convergence Ecosystem. Read on to see how unsupervised learning can be used to optimally deploy a decentralised ride sharing fleet and how the resulting data can be monetised, all while running in the Convergence stack.
Real data is imperfect. A computer might know what a cat is from a picture of a cat, but change one hair and suddenly the computer tells you it’s not a cat. Humans are very good at grouping similar things under a single name, but computers, operating in a world of ones and zeroes, are not. You might say that it’s still a cat, but your computer will disagree, referring you to that one hair.
It’s problems like these that gave birth to the field of machine learning. In its early days, a core technique, now a staple of neural networks, emerged: classification. This technique allowed computers to be taught to distinguish between objects with some inherent variation, such as the hairs on a cat.
It works as you might expect to teach a child: you show an AI pictures of different cats, and it starts to learn the similarities while discarding the differences. You do the same with pictures of dogs, and eventually your AI has a pretty good idea of what cats and dogs are. Machine learning emerged out of a model of human learning, and the results were unprecedented. Why don’t you ask your phone to show you pictures of cats right now?
We can teach our computers. Still, they aren’t really on par with humans (yet).
So what’s the problem? The AI would never have known how to classify a cat or a dog if you didn’t tell it what it was seeing. Classification is an example of supervised learning, which means a human has to be there to teach the AI what’s what. Humans, until recently, had a leg up on AI in that they were very good at teaching themselves.
This is where clustering comes in. Perhaps the most fundamental and important technique for unsupervised learning, clustering doesn’t demand any prior knowledge of what a cat or dog might look like. Clustering means grouping objects based on patterns in data. It might not sound complicated, but for an AI, it means understanding the distinction between a cat and a dog without ever having been taught what either are. All you need is data.
Enter Ocean Protocol.
AI advancement can be sped up substantially with more data. Big companies have big data, but startups do not. Ocean is creating a global data marketplace to bridge this gap. If you have data, Ocean can help you not only to make it accessible to the world, but to monetise it too.
A problem remains: without the right tools, your data becomes hard to manage. In the early days of a decentralised world, data stores are are largely unstructured and difficult to work with. If you have no means to serve your data, you can’t be a part of Ocean’s global marketplace to begin with.
Enter Haja Networks.
Haja Networks’ flagship open-source project OrbitDB gives you control of your data in a decentralised world. In short, OrbitDB is a distributed P2P database. Building on top of IPFS, OrbitDB allows anyone to store, share and manage their data without being constrained to data silos.
Clearly, there is considerable potential for synergy between Haja and Ocean as we move toward convergence. Though as of yet, there is no bridge between the two. This untapped potential inspired us to build our latest addition to Web 3.0.
Introducing Outlier Ventures’ H2O
[image id=’3348′ alignment=’left’]
H2O, short for Haja to Ocean, is the first app to run in the convergence stack, building on top of multiple software frameworks that underpin the convergence ecosystem. H2O is a demonstration of machine learning in Web3, allowing users to run a clustering algorithm on data stored in OrbitDB and publish the resulting AI datasets to the Ocean Protocol blockchain.
H2O makes use of various technologies, including Python, Node and Angular. A separate component, H2O-Host, has also been created so that developers can upload their own datasets to OrbitDB, as well as generate datasets for testing purposes.
To demonstrate how OrbitDB is a functional, ready-to-use tool in the convergence ecosystem, H2O allows the user to enter an OrbitDB address of their choosing. The app plots the dataset, with the user able to specify the number of clusters the algorithm should aim for. H2O will parse any two-dimensional set of points using K-means, a popular and efficient clustering method. The clustered output is also plotted for the user to see.
Once the algorithm has run, datasets can be published to Ocean Protocol. Publishing is done using Ocean’s Squid API for Python. H2O has built-in support for Azure hosting, the current standard for Ocean Protocol, as well as provides proof-of-concept hosting using OrbitDB for developers.
H2O is able to generate useful preprocessed datasets for AI and use them to seed Ocean’s global data market, all on a decentralised storage solution made possible by Haja Networks. Though it is largely demonstrative for now, H2O is the first step toward apps on the convergence layer.
Let’s get technical
The best way to understand how the components of H2O fit together is by following the flow of data.
[image id=’3349′ alignment=’left’]
Let’s say we have some data. First, we need to make it a part of the decentralised web. To get the dataset on IPFS, we use H2O-Host. A lightweight application, H2O-Host creates an OrbitDB database from JSON-formatted data and sets us up as a peer (or node) on IPFS. Our database is referenced by an IPFS multihash, and anyone with this address can get a copy through OrbitDB.
The H2O-Host component includes a data generation script for testing, as well as an example dataset of uber pickup locations.
Once a few people have a copy of our database, we get to see one of the standout features of IPFS and OrbitDB. When someone goes to our multihash, the closest peer to them is intelligently selected to serve them the data. This means as the network grows, the speed of serving content collectively improves. Better still, the data isn’t really stored anywhere in particular, but bounces around as it is needed. This massively improves censorship resistance and the ability to withstand denial of service attacks.
Ok, we’ve made the dataset available. Where does H2O come in?
We want a nice clustered dataset for AI that’s available to everyone. H2O is going to process it for us, but first it needs a copy from IPFS. Fortunately, OrbitDB makes this easy for us: we enter the address of our desired database in the H2O UI and the backend fetches the database by calling Orbit’s replicate function.
Let’s say our crypto-economics team in Toronto wants to send some unstructured, clusterable data for our London tech team to process for AI. With H2O-Host running in Toronto, replicating the data to H2O in London will take us around 10 seconds. This is unprecedented: IPFS is already demonstrating its potential as a viable alternative to HTTP, a protocol that’s been in development since 1989.
Now we have a copy of our dataset in H2O. Let’s cluster it.
Machine Learning in H2O
At this point, we probably want to take a look at our data. Fortunately, H2O visualises it for us by plotting it as soon as it has a copy. We can take a look and tell the algorithm how many clusters it should aim for.
The clustering technique H2O implements is K-means. K-means does what it says on the tin: it finds k means, or cluster centres. To do this, the algorithm divides a dataset into groups having approximately the same number of points closest to them. In other words, K-means looks for data density. This approach is known as vector quantisation, and allows K-means to assign each datapoint to a cluster centre, creating a grouping.
[image id=’3377′ alignment=’center’]
K-means visualization by Andrey A. Shabalin
K-means has a high computational complexity, which means computing it is difficult. H2O computes the problem in Python using the open-source machine learning library SciKit Learn. SciKit Learn makes a vast array of machine learning algorithms available at our fingertips and is currently one of the most powerful tools for developing AI.
While Python drives the back-end of H2O (along with some NodeJS), the front end is written in Angular and interfaces with the back-end using Flask. This allows H2O to quickly relay and visualise relevant information, such as rendering the clustered output following a K-means computation.
To relate this to real data, we can use the Uber pickup location data for New York included in the H2O-Host repo. Let’s say we want to start decentralised ride sharing app. If we own 5 cars, we take a sample of the pickup data, load it in H2O and specify 5 as the number of clusters. Plotting the resulting dataset on a map of the city, we can see the car deployment locations which minimise customer wait time and fuel costs.
[image id=’3350′ alignment=’center’]
The processed data, now monetizable on Ocean Protocol’s global data market, is ready to be published.
Serving the data
H2O integrates with the Kovan testnet, and developers who opt to use it are fed an Etherscan link to their published smart contracts when registering assets.
Note that Ocean Protocol, as a global data marketplace, does not host published datasets. The network demands that users host their own data and currently only supports Azure storage, though several alternatives, including decentralised options, are on the roadmap. H2O interfaces directly with Azure storage using the Azure Python SDK. In H2O, we’re able to directly upload our dataset in a minimised JSON format using only an account name and access key. The dataset is written straight to an Azure blob and a machine-readable download link is passed to the Ocean blockchain. This link is also fed back to us, the user, so we can download our clustered data immediately should we want to.
In addition to Azure hosting, H2O includes Proof-of-Concept OrbitDB hosting. After all, OrbitDB can serve a user data in very much the same way that Azure can – it just isn’t supported by Ocean Protocol yet. Guides for getting started, whether with Azure or OrbitDB, can be found in the docs. Even if you don’t have access to any data yourself, H2O-Host comes with scripts for generating clusterable data that you can feed into H2O, so test away – check out the GitHub and live version.
For the full H2O presentation at Ocean Protocol’s Trilobite announcement see below: