Abstract
It has been widely recognized that data can be viewed as a kind of assets. But accounting for data assets and pricing data transactions are still difficult due to the lack of reasonable measurements of datasets or data products. Literatures of data pricing mainly focus on traditional pricing models including models basing on contents of data, demand of market, data quality, etc.. However, due to the particularity of data, the above models may not coincide with the measure theory and thus suffer from some problems. For example, they do not consider how to price datasets sharing common contents; whether we should pay for a repeat purchase; and how to define peak-valley tariff formally for usage-based pricing. To tackle the above problems, in this paper, we formally define measure spaces for datasets and data products. Specifically, we introduce the measures on discrete, continuous and product data spaces respectivaly. Further we introduce the integral and propose a measure based pricing framework for data products. Our work is parallel to existing pricing models. We fouce on how to measure data, and pricing data is a natural extension by integrating the unit price function under the measure. In contrast, existing models focus on determining total prices directly by considering lots of factors like contents of data, demand of markets, etc. By doing analyses on several real-world applications and cases, we prove the effectiveness and generality of our proposal.
Introduction
In the era of big data, data are widely viewed as a vital element, production material and fundamental resource, and thus has drawn great attention from researchers. The importance of viewing data as an asset has been widely recognized. However, it is still difficult to determine the prices of data assets and data transactions.
We note that pricing songs, electronic books, etc., is relatively easy and may have been accepted by majority. The reason why it is easy lies in the fact that they have counterparts in the real world. For instance, a physical CD may take us $10, while iTunes charges $0.99 per song online. We price songs in a similar way that we price CDs in the real world, i.e., piece by piece, despite their unit prices may depend on different aspects. But data that have physical equivalents are mainly from digitalization processes and they only account for a small portion of data in the world. For “native-born” data in cyberspace, it is still controversial how to price them since we could not find their physical equivalents, and thus it is unclear how to “enumerate” them like songs or books.
The absence of reasonable measurements of data is the main obstacle that impedes pricing. Since a dataset (or a data asset) means differently to different buyers, it would result in price discrimination and opaque markets [24]. Sellers and buyers may hardly agree on prices if we do not have a universal and widely accepted way to determine the size of data. For example, a dataset of detailed climate records containing sensor locations, temperature, humidity, rainfall etc., seems less valuable for a researcher who merely studies relations between temperature and rainfall than those who study weather forecast. The former may have a misunderstanding that this dataset is “small” since he only needs temperature and rainfall data. But the essence of dataset, i.e., its content, is a constant. What really differ are their psychological prices reflected by their needs. It is necessary to separate measuring from pricing so that we can eliminate these misunderstandings and conflicts by setting user-independent measurements of data. Considering that the cyberspace is overwhelmed by native-born data, e.g., system logs and sensor records, and data trading and distribution inevitably involve these data, it is crucial to find a way to measure them. In this paper, we study data measuring with the help of measure theory and further study data pricing by introducing integrals. We explictly measure datasets and data assets with mathematical tools, so their sizes can be undisputed determined, and then prices of datasets and data assets can be computed given unit prices. By separating measuring from pricing, more widely accepted prices can be made, and more transparent markets can be establied. Meanwhile, we still have flexibility to manipulate prices by determining unit prices.
Though there have been many data pricing models, which we will discuss in
Section 2, we find that most of them mainly treat a
dataset as a whole without coinciding with the requirements of measures. For
example, how much should we pay for two datasets that share a common part? This is
an usual situation when we would like to buy two digital collections, i.e., music
albums, with common items, i.e., songs. Moreover, some data suppliers like tradestation.com promise that they will keep datasets updated. If
some users have different versions of the same dataset, how should the supplier
determine personalized prices of an update? Formally speaking, assume that
A and B are two datasets with

AWS Spot Instance. Source:
On the other hand, for usage-based data products, the unit price may vary like peak-valley tariff, and it is complicated to model. For instance, in AWS Spot Instance, the price is determined by the bidding prices from different users, as shown in Fig. 1. Moreover, an user’s instance would be stopped at some time if his bidding price are lower than the market price at that time. He needs to raise his bidding price to re-obtain the right of use of the instance. From this example we can see that an effective and flexible framework is needed for pricing.
These problems are due to the special properties of data products, e.g., near zero marginal costs. Most existing pricing models do not meet the requirements of measures, like countable additivity, and thus have flaws. So, prior to making data pricing models, we need to research measures for datasets and data products. Not until measures for data are made, could we study pricing and accounting problems, which then facilitates the data distribution and data trading.
To tackle the above problems, we formally define the domains, measurable spaces and measures for data products. We consider data products that are accompanied with discrete and continuous variables, or discrete/continuous data products for short. We propose measure spaces that are suitable for datasets or data products, on which we further introduce integrals to provide a flexible way to price data.
We need to highlight that our work is parallel to existing pricing models. Most of the existing models focus on how to determine prices by considering lots of factors like contents of data, demand of market and data quality, but neglect the basic requirements of measure theory. In this work, we study a more fundamental problem – how to measure a dataset or a data asset. Once we figured out how to measure data and introduced measure spaces, we can naturally price them with integrals. Moreover, our framework can incorporate previous pricing models seamlessly by using classic pricing models to determine the unit prices, which will act as a part of integrands. Thus the framework we proposed decouples data measuring and unit pricing, and is flexible enough to adapt to various domains.
The contribution of this paper is two-fold:
We propose a pricing framework basing on the measure theory. We demonstrate the measure spaces for discrete and continuous data products. Some common measures are also illustrated.
We study lots of examples and cases from real-world applications, which can be formulated under our proposed framework. This proves the effectiveness and generality of our proposal.
The organization of this paper is as follows. In Section 2, we review and analyze the related work. In Section 3, we introduce the basic concepts and notations of the measure theory, and then define the measure space for discrete data products, i.e., those can be organized as discrete and finite sets. We also analyze several real-world applications and show how to formulate them under our proposal. In Section 4, we define the measure space for continuous data products, i.e., those are accompanied with a continuous variable like usage or duration and do show some examples. In Section 5, to cover more complicated applications, we study the product measures basing on the previous two sections. Examples prove the flexibility and effectiveness of our pricing framework. In Section 6, we review several data trading platforms and show how our proposal can facilitate them. At last, we conclude this work in Section 7.
Currently, data markets are still in their infancy. The economic principles guiding the pricing of data, data products and the data services have not been largely explored [5]. Although a standardized pricing model would facilitate transparent transactions and improve efficiency in the data market [12], existing data-pricing mechanisms lack a data measuring model [27].
Literatures of data pricing mainly build upon traditional pricing models. For
example, Moody and Walsh [17] introduced a
number of laws to define information as an asset and modified the historical cost
method for valuing information. We briefly classify literatures into the following
aspects. Besides, some scholars [18,23,27]
summarized existing online and offline pricing models of data products
in the data market. (i) Free models are those that data or data services
can be used for free, such as the data of some public storage can be
obtained for free; (ii) In Freemium models, consumers can obtain or use
limited data products for free, and pay for value-added services. (iii)
Pay-per-use models are those that fees are charged based on users’ usage
counts, which has been applied to some API calls. (iv) In packaging
models, consumers pay a fixed fee for a certain amount of data or
services. (v) Flat-fee models involve data consumers paying a
pre-determined fee for unlimited use of data or services in a certain
period of time. (vi) In two-part-tariff models, consumers pay a fixed
price for the basic services, while extra payment is required for
outside pre-defined quota. Fruhwirth et al. [9] reviewed various data marketplaces and their
business models, summarized their pricing models and price discovery
mechanisms. They discovered that pricing models used by existing data
marketplaces include usage based pricing, package pricing, flat fee
tariff, and freemium, while price discovery mechanisms include fixed
prices, seller-set prices as well as prices decided through auction or
negotiation. Liang et al. [15] classified different data market structures, data
pricing strategies and data pricing models. The data pricing model
consists two general branches that are economic-based pricing model and
game theory-based pricing model. The cost model, consumer perceived
model, supply and demand model, differential pricing model and dynamic
data pricing (smart data pricing) model are the group of economic-based
pricing models, while game theory-based pricing models include
non-cooperative game, stackelberg game and bargaining game,
etc..
Measures on discrete data spaces
To facilitate data pricing and data distribution, We need to find a concise way to describe the sizes of datasets and data assets. Inspired by counting items in the real world, we find that in cyberspace some data can also be counted piece by piece.
In many applications, data consist of several discrete data points. For example, an image dataset like ImageNet1
Many data products can be viewed as finite sets of discrete data points. For
example: Digital music: In 2003, all
music available on iTunes Store2
PPC (Pay Per Click): Keyword search engines such as Google Adsense3
Subscription services: Many service providers now offer subscription services. Spotify,5
Let us review some definitions and notations of the measure theory [11]. Given a set X, we
denote
(Countable Additivity) for any disjoint
The triple
Now, we introduce the domain we use in this section. We denote
Given a function
Examples
In this subsection, we prove the effectiveness of our pricing framework by
analyzing several real-world cases, and answer the first two questions in
Introduction. Given a
set of records Let X denote the set of
files, where Let X denote the set of all ads in a PPC
advertisement system. Clearly, an ad is charged by clicks. Unlike
Example 3.1, clicking an ad twice is
different from clicking once. So, we solve this problem by using the
integral. We use In the scenario of subscription services,
we denote X as the set of all tiers of subscription and
We denote X as the set of all musics in
iTunes and use the counting measure as μ. If we use the
constant function
Besides finite discrete sets, many datasets or data products are charged according to a continuous variable like usage or duration, which can be represented by continuous and infinite sets. Considering counting data points piece by piece can hardly be applied here, we need to resort to Lebesgue measure to define the sizes of data. So in this section, we study measures on continuous data spaces with the help of the Lebesgue measure and integral on data.
Background
Many data products are priced with respect to the usage or duration. Massively multiplayer online (MMO) games:
MMO games such as World of Warcraft7
Servers (Cloud computing): AWS (Amazon web service)9
Online streaming platform: Datacast streaming service12
Cloud service: Alibaba Cloud IOT platform13
Amazon EC2 Spot Instances:14
In this subsection, we describe the pricing framework for data products that are
associated with a continuous variable. Considering that the usage or duration is
generally non-negative, we use the half line
Examples
In applications that are charged by duration, we could use a constant
function
In Alibaba Cloud IOT platform, we use a piece-wise function
In the market of AWS EC2 Spot instances, we denote the market price at time
t as
In previous sections, we have discussed measures for discrete and continuous data spaces. However, in many applications, data products may have complicated pricing strategies that involve multiple variables. In this section, we utilize product measurable spaces and product measures to cover these complicated cases.

The Lebesgue integral.
There are many applications that involve multiple discrete or continuous
variables for pricing. Membership
subscription: Amazon Prime membership subscription service15
Patreon: Patreon16
Amazon EC2 Spot instances:17
In this part, we talk about product measures. Without loss of generality, we only consider the product of two spaces.
First, we illustrate the product of two spaces with discrete sets. Given two
measure spaces
For the product of two spaces with continuous sets, i.e.,
Similarly, given
Examples
In Patreon, patrons could have different bonus services according to their
payment tiers. Here, we denote
The unit price function h of Example 5.1
Now, the integral
We extend the Example 4.3. We denote
X as the set of instances with different
configurations. Since X is discrete, we could formulate the
In this section, we briefly review some data trading platforms and see how our proposal can facilitate their showcases as well as trading processes.
youedata.com and juhe.cn are online datasets/API trading
platforms. Each dataset is marked with a total price. Like what we discussed before,
mixing measuring and pricing together on the one hand complicates the process of
pricing, and results in counterintuitive prices that do not meet the requirements of
the measure theory. On the other hand, a consumer may find it hard to figure out how
many records are contained and what the unit price is. For example, in the index
page, a cheaper dataset with few records attracts more consumers, but it may have a
higher unit price than other competing products. This harms the fairness of the
platform. By using our framework, the finer granularity of data products can be
examined. We set the price as the integral
Some data suppliers in platforms like tradestation.com, datatang.com and
datasl.com may promise that they will keep datasets updated. But how
do updates charge? This is a critical question for both consumers and suppliers, but
has not been dealt with appropriately in existing platforms. As we mentioned
earlier, cares should be taken when we face maybe overlapping data. If some
consumers bought different versions of the same dataset, and the data content of
each consumer is slightly different from others’. The data supplier will suffer from
determining the personalized prices of the update for different consumers. With the
help of measure theory adopted in our proposal, we resolve this problem nicely.
Since we introduce the measure μ, the size of updated part can be
easily computed by
Another popular option for data trading is to negotiate prices between consumers and
suppliers, where platforms like factual.com, datasl.com and
finndy.com act as intermediaries and only charge management fees.
Under our framework, only the unit price f needs to be determined
through negotiation, at which time demand of market, data quality and other factors
may come into effect. Then the total pricing along with management fees can be
computed by the integrals
Discussion From the examples introduced in previous sections and the aforementioned platforms, we can see that the framework proposed in this paper can be adopted well across various domains. The core idea is to separate the measuring process from pricing data. Measures are also independent of consumers and suppliers, so that everyone can agree with the “volume” of a dataset. Once we make measures for data, we could “enumerate” them just like physical things, and experiences of pricing physical things will help us price data by determining unit prices. Our work is parallel to existing pricing models and we also could take benefits from them. The measuring and pricing framework we proposed in turn makes it possible to solve accounting problems and facilitate data distribution and data trading.
Limitations Due to the high-level abstraction, these are still some gaps to fill before applying our framework. The granularity of data points, or equivalently speaking, the elements of the dataset D, need to be determined. For example, the suitable unit for ImageNet dataset is images, rather than pixels. Moreover, we need to define the measure μ and the unit pricing function f when we encounter with a new scenario, where exploration of data properties, market needs, etc., is still inevitable.
Conclusion
In this paper, we review multiple datasets and data products, and formulate the
measure spaces for discrete and continuous data. By introducing measure spaces, we
could measure the volume of
The future work may involves the following aspects:
In addition to the counting measure and the Lebesgue measure used in this paper, we could consider more measures to cover more applications.
We may research pricing models of datasets or data products in specific fields. Based on the existing pricing framework, it is possible to study the pricing models of data assets in certain domains.
Practicing the pricing of data products while taking existing economic principles into consideration is also a potential direction. Accounting is a process of measurement. Now that lots of pricing frameworks concerning data products have been summarized, it is possible to conduct pricing studies on these data products. Standardizing and regulating the pricing of data products are beneficial to the data trading and distribution of them in data market, as well as researching the pricing of other data assets.
