Martin Saps

Data Science in Blockchain

May 20, 2019


After all, blockchains are open, high-availability ledgers whose maintainers have an economic incentive to keep them perfectly up-to-date. Blockchains are highly reliable and architected to be as small as possible. They make sourcing data easy.

This article discusses a practical approach to data science for the Web 3.0.

Why data science for blockchain?

The key difference between data science and traditional data analysis is the ability to gain insight from data without the need for a human-centric approach. Unifying the fields of statistics and machine learning, data science allows for the inclusion of big data, ambiguity and even unstructured data sources. With roots in computer science, the field automates much of traditional business intelligence, deriving more useful and more accurate insights with less capital expenditure and in less time.

Blockchain, now a mainstay of open-source innovation, solves one of the core problems data scientists face: sourcing data. Breaking away from the traditionally secretive fintech world, blockchain incentivises data sharing and is designed with replicability in mind. 9 times out of 10, blockchains store timestamped financial data alongside a host of variables which might correlate with and influence it. Predicting, forecasting, insights and analytics have a concrete base in blockchain.

Where do you get the data?

The vast majority of blockchains are public, permissionless ledgers maintained by nodes. A node is a computer with a full copy of the blockchain data, for example past transactions, that stays up-to-date with any changes to the database.

To get data, you can either run your own node, pull from an existing node, or make use of hosted software which allows you to plug data queries directly into a network of your choosing.

The most direct approach, running your own node, involves starting a blockchain client, such as Parity for Ethereum and leveraging an ETL job, such as ethereum-etl. A slightly less cpu-intensive approach involves plugging an ETL job into an existing node using tools like Infura.

Hosted solutions, such as Google BigQuery, also offer the ability to run data queries directly on existing crypto-network data without the need for additional software.

The leading blockchain data source designed for data scientists, however, is Ocean Protocol. The platform handles dataset curation for you, immediately identifying the datasets that benefit prevalent machine learning algorithms the most.

Where Ocean Protocol differentiates itself from other chains, however, is that it is not limited to the narrow financial scope of a ledger. In fact, any and all data may be hosted on the platform, including an economic incentive for big data providers to do so. Further, one of its most unique aspects is that the platform enables you to compute on private information using an algorithm-to-data approach, which has far-reaching applications in medicine, finance and AI.

Which tools do you use?

Before starting, a good data scientist identifies their requirements and selects the most appropriate approach to the problem. There are a host of tools available for developing data solutions, whether working with neural networks, autoregressive models or traditional analytic methods. It is important to understand how to judge accuracy, test software, and present results before putting together a toolkit.

A key language in any data scientist’s arsenal is Python. Python’s ecosystem of libraries make working with data easy, whether it be with statistics frameworks like scipy and seaborn, data manipulation with numpy and pandas, or machine learning with sklearn. R, the former leader in this space, offers an equally mature ecosystem while being more tailored to statistical analysis.

When it comes to performance-critical components like high-frequency execution engines, C++ is often used to drive backend work. Rust is also gaining a lot of ground in this space, offering more modern syntax and better code safety without sacrificing performance versus C++.

Once you’re able to deliver a product, look to Ocean Protocol to monetise your work. Not only can you sell your outputs, such as filtered data and predictions, you can also offer your algorithm as a paid service, or even sell it outright.

What is Outlier Ventures doing in data science?

Outlier Ventures sits at the crossroads of technology and finance. As a value investor in one of the most innovative technology sectors, it’s important that we help to develop the industry we invest in. Recently, we’ve been working on a joint project between our internal technology and crypto-economics teams to create a predictive analytics engine for the wider industry, and we can’t wait to share it.

We maintain our commitment to open-source, so if you’re a data scientist looking at blockchain, keep an eye on our GitHub. Our latest addition to the Convergence Stack is almost ready.

Stay in the loop

Subscribe to our weekly overview of the crypto market.