In Data We Trust

Title

In Data We Trust

Team member names

Gary Isaac Wolf
Thomas Blomseth Christiansen

Short Summary of Your Improvement Idea

We propose improving the routines for securing trust in empirical data with a protocol that can ensure provenance, integrity, and addressesing for observational records at an arbitrary level of resolution, while keeping computational cost reasonable for most real world use cases. We will use our PIG summer to build a working model of this idea and expose it to people to use and examine. The model will put into practice new protocol primitives that make it tractable to automate data trust routines, supporting verifiable provenance and detection of tampered data while establishing guarantees of the temporal boundaries around when each empirical observation makes the transition from the physical to the digital realm.

What is the existing target protocol you are hoping to improve or enhance?

Routines for securing trust in empirical data.

What is the core idea or insight about potential improvement you want to pursue?

The growth of empirical research and use of real-world sensor data more broadly renders handcrafted trust routines obsolete. Applying a relatively simple protocol at the atoms-bits border permits automation of data provenance and forensics routines downstream, with the potential to dramatically upgrade social reasoning processes.

What is your discovery methodology for investigating the current state of the target protocol?

The lab is our field. We work with researchers building instruments themselves for data acquisition, and we have experiential knowledge of common data trust failures (incomplete metadata, data fumbling, rogue behavior) that we will simulate in our model. Additionally, we might critically review some monstrous artifacts of standards-based approaches whose nightmarish absorption of futile attention provides some of our emotional fuel.

In what form will you prototype your improvement idea?

We’re going for running code that people can inspect and play with. This includes a model simulator and a functioning breadboard crypto instrument talking to a one-node Cosmos appchain.

How will you field-test your improvement idea?

We’ll test the idea by building a model world in which there are agents that generate data, agents that cause trouble by fumbling the data in recognizable ways (bad, bad agents); along with user-operated controls supporting forensics so that people can test our method for securing fine-grained provenance. The model will inform discussions with our collaborators about how to implement the protocol in their instrumentation and benefit from it in downstream processes in their labs.

Who will be able to judge the quality of your output?

Albert Wenger
Primavera De Filippi
Chris Dixon
Brian Nosek

How will you publish and evangelize your improvement idea?

  1. Longish essay about provenance trouble that could be fairly entertaining because the examples of what happens when provenance fails have something of a America’s Funniest Home Videos quality to them. (You just knew something terrible was going to happen when you saw the skateboard and the slide.)

  2. Computational notebook providing access to our implementation.

What is the success vision for your idea?

A future where decisions about the trajectory of a scientific inquiry, the efficiency of a production process, or the direction of government policy, are made in the light of verifiable, trustworthy empirical data.

4 Likes

Feedback question

Note: In the time since posting this question I believe we’ve answered it for ourselves through reading others’ posts: it appears that our protocol definition is on the far end of specifically recognizable in the real world, so we think we’ve confirmed we’re in scope. Leaving this reply up for anybody who wants more background, and of course open to correction, but didn’t want to pretend we still had the same level of uncertainty as originally.


We wanted to obey the ~500 word requirement, but we have a specific feedback question that we wanted to ask, and perhaps it will provoke some ideas or questions.

What we call data trust routines take different forms in different domains. Provenance issues are common when dealing with sensor data, survey and polling data, clinical research data, and measurement of industrial processes. The domain specific trust routines go by a variety of names, but they follow roughly the same steps. There is a planning routine involving adoption and use of standards and commonly accepted instruments; a data collection routine involving process checks and quality sampling; a documentation routine involving arrangement and access to records; and a forensic routine involving inspection, troubleshooting, and assurances to third-parties.

These routines operate on the border between real world mess and formal representation. In the real world headers get lost, people commit intentional fraud, and nobody remembers what version of what instrument was originally used. Maintaining the tie between the world and its representation is time-consuming and expensive. For instance, in clinical research there’s an activity called “source data verification“ that has been measured as an industrial input, with estimates as high as 30% of research cost. While this aggregate figure is itself subject to the same source data worries it purports to quantify, every empirical researcher knows from experience that the cost of building trust in data is real and significant. Narrative examples of the problem include headlines like: Stanford president to resign after investigation finds he failed to ‘decisively and forthrightly’ correct research.

We realize that in our definition of the protocol we’d like to improve we’re not describing a single set of concrete operations. There is a lot of diversity here. But all these trust routines have common features that allow us to conceive of a universal upgrade across the entire set. While the routines across these domains may not be quite recognizable as a single protocol, we want to demonstrate our idea for a missing primitive that can protocolize them. Does this fly with you? Do you accept our protocol framework (planning, data collection, documentation, forensics) as a good definition of a protocol to be improved?

We’re asking for this feedback because we don’t necessarily have to go with this degree of generality on the problem side. Our idea for improvement originates in the domain of electromechanical measurements related to health, where we’ve been working for a long time, and have experience building our own hardware and addressing concerns about data aggregation and traceability. If we stuck to this, the relevant current protocols would be much easier to describe concretely. BUT the protocol improvement we have in mind is more general and we’ve gotten pretty interested in thinking about it this way. So: is this starting point ok? Or do you think a more concrete, domain-centric starting point would be more in synch with this year’s design?

Regarding

I am reminded of faust.

1 Like

This link (thank you!) prompted an interesting short discussion this morning during CET/PST overlap hours. Roc proposes a new kind of hardware that speaks crypto from the ground up. We 100% support this idea. It must happen.

OK, so: can it happen?

And: What can we do to make it more likely to happen?

Ted Nelson described universal transclusion in 1980. Today, if you look at tutorials on Roam or Obsidian, you will see people marveling at (and struggling with) merely local transclusion. We’re forty years down the road! Why? Because, as anybody on this forum probably understands, full transclusion requires a world computer. Unless you can think of another way to do it? And as we know, the concept of a world computer didn’t even show up until 2013. Something important was missing. Nelson not only could not have invented it, he didn’t even know it was needed.

In our field, we hear all the time about “moonshots.” When I hear it, I try to force my face into a blank expression, because I typically like the person who is using it and want to be supportive and hear them them out. But it’s painful, because it is used to mean: “if we know what we want to do, state it clearly, supply enough resources, and work with courage and faith, we can count on reaching our goal.” This is not true, and should not be the lesson anybody takes from Apollo. The important thing is to develop a sensitive awareness of where you are in the story; meaning, what are the real constraints that have to overcome in order to realize what’s obvious conceptually. Do we really know where we are in the story?

A few years ago I organized a Quantified Self/Hyperledger MeetUp in Amsterdam and we got into a good conversation about protocols for empirical research, based on Thomas’ talk about time-stamps and time-series data. It was easy to get to the end of the story in our imaginations: secure, traceable real world data in blockspace.

But the intermediate steps: wow. Could you get even a rough sense of how many there were? Even today, it is very easy to say “zero-knowledge proof” but go ahead and try to prototype this so that it works for a network of very small sensors that may be turning on and off only briefly to make an intermittent data transfer connection, while otherwise preserving battery life. Oh, and also please be practically useful to grad students responsible for data collection on shoestring budgets. This is just one quick sketch, there are many other, less instantly comprehensible constraints.

The hand-wavy solution treats data as a blob. Notarize the blob as soon as it arrives at a convenient transit point and your work is done. There are number of reasons why this isn’t good enough. No time to go through this in detail (hopefully we’ll have to summer to learn to be clearer), but, to do the problem in your head, just ask: What counts as an instrument? This question is harder than it looks, especially if you don’t give in to the temptation to reason in a circular fashion. (Circular: We’ll provide certain kinds of instruments with IDs and define instruments as only those kinds of things to which we can give IDs.) When you start in the actually existing world of real instruments, today’s instruments, and ask: how can we issue secure universal IDs to each instrument, however it is defined, and notarize each observation, at any temporal resolution, then you get to a really fun domain of questions.

Of course this fun comes with a price: you start to realize there are a number of very different and very hard problems between where we are today and a full stack solution, many of which you cannot solve yourself, even if you’re in a good mood and willing to work hard. That means you depend crucially on what other people do, and you often don’t know for sure how well their part is working out.

One day I was holding my head in my hands after hearing the word “moonshot” one too many times in the same week and Thomas said: “What if we think we’re NASA in the sixties but we’re really more like Robert Goddard in the twenties?” Brutal, but I laughed. Our way of coping with this uncertainty about where we are in the story is to try create a protocol for some fundamental services needed right now, but to create them in a way that will survive technical upheaval. The landmark example is TCP. With that simultaneously modest and megalomaniacal goal in mind we’re asking questions like: What is an instrument? What is an observational record? What is traceability? Maybe if we can get the primitives right, they’ll even get into Roc.

3 Likes

Yes, thank you for the pointer to Roc, @n_a. Interesting to compare and contrast their approach to how we are thinking about it. At this stage, they’re very light in detail in terms of what they’re really doing—it’s all very stealthy—but it seems like their architecture is full stack going all the way from hardware up through software.

Looking at their approach, I’d say they are working in a different region of the trade-off space than we are, which will make certain things easier and other things harder. Hardware for sure is hard, but in a certain sense, if you’re opting for having your own hardware all the way down to where it’s touching reality, you’re also making some things easier for yourself. And then the trickiness moves into other dimensions. I think adoption will be a challenge for Roc.

What we’re trying to do is to approach the problem from a different angle, working from the premise that there are already a lot of instruments out there. It’s unlikely that we’ll be able to provide some kind of hardware that’ll go in and take over from all these, if not billions, then at least millions of instruments. The challenge is how can we offer as few primitives as possible from a protocol point of view that are capable of being integrated and used in existing hardware? That also means that there are a lot of constraints that we’ll have to adhere to that you don’t necessarily have when you’re doing your own hardware.

As Gary @agaricus mentioned above, we have been thinking a lot about how to support existing resource-constrained hardware, even network connection-constrained, where you might have intermittency in the network connection, which also poses some very interesting limitations on what you can do protocol wise. And then take those as a starting point to see how we can make some at least minimal progress towards ensuring data integrity and provenance for sensor data and metadata coming off of these instruments.

To a large extent, we come from a more traditional network protocol and distributed systems kind of thinking. This might be a faux pas in this company to say, but for us, what crypto offers in terms of the consensus protocols of blockchains are some useful techniques and mechanisms that we can use for solving some of our very specific challenges. In that sense, we might have less of an ideological stance when it comes to crypto, but more of a practical utilitarian stance. If you know your distributed systems theory, then you also know there are certain problems about coordination and consensus that weren’t really solved before the advent of crypto. And of course, you want to utilize the techniques that are most fit for purpose.

Also, there are some crypto techniques that, at least according to our current understanding and the experiments we’ve been doing so far, which are still too far off in terms of being mature and also are still too costly in terms of resource use. For instance, zero-knowledge techniques are definitely useful in certain contexts, but it seems like when we are in this domain of resource constrained instruments, then it’s basically too early. Also, when we’re not controlling the hardware, we can’t require things like secure enclaves.

So the question then becomes, how can a sufficient level of certainty around empirical data provenance and integrity be reached? What is both useful and a significant improvement over current practices?

I think there is a much more fundamental point to be made here, which probably is so ingrained in our thinking that we don’t say it often and loudly enough, which is that the coupling between the physical and the digital realm can only be probabilistic. For us, that’s a core epistemological tenet: you cannot make a hard coupling between the physical world and the digital realm, even if you control the hardware all the way down to the bottom. Probabilistic effects will always sneak in and until we might discover new fundamental theories of physics, we have to deal with the probabilistic nature of that coupling and not try to fool ourselves.

Going further, then there is the computational boundedness of both us as observers and also the machines we can build, which leads me to a comment on the idea of “notarize reality”—which might work as marketing slogan but which if taken at face value as a program statement is epistemologically naïve. To me it has the same ring to it as what we’ve been hearing about self-tracking for many years, which is, you can just “track all the data”, but there’s no such thing as all the data.

Due to our computational boundedness, for all practical purposes, reality to us appears like it has an infinite many aspects to it, which means that every time you build an instrument, every time you set up an experiment, every time you start monitoring something, there is a kind of editorial process going on, too. You have to choose what to measure. You cannot track everything. You cannot “notarize reality” in its totality. It’s off limits. We are limited to phenomena of interest. We are defining phenomena of interest and then we’re devising machines to measure those phenomena and having those machines turn those measurements into digital data. That’s how far we can go.

So given the probabilistic nature of the coupling between the physical and the digital realm and the infinitude of aspects of reality, we need a different kind of approach than epistemologically naïve slogans. How can we increase the certainty around the answers we get from empirical data? And how can we increase the certainty that these questions might be answered far down the line in the future? In principle, in a hundred years, you should be able to ask questions of these data, both the substance of the observational data, but also of the metadata and then hopefully be able to cross the threshold of being able to trust the data for your reasoning purpose.

1 Like

To build trust at scale, it’s important to be clear on what choices you’re making during data collection, and more specifically, what you’re not including and what questions you can’t answer with data. Read a stellar essay from C. Thi Nguyen on this point this morning, which I like to think of as the “how” to @davidtlang’s “why” in Standards Make The World.

Here are some relevant bits:

From a policy perspective, anything hard to measure can start to fade from sight. An optimist might hope to get around these problems with better data and metrics. What I want to show here is that these limitations on data are no accident. The basic methodology of data—as collected by real-world institutions obeying real-world forces of economy and scale—systematically leaves out certain kinds of information […] And these limitations aren’t accidents or bad policies. They are built into the core of what data is. Data is supposed to be consistent and stable across contexts. The methodology of data requires leaving out some of our more sensitive and dynamic ways of understanding the world in order to achieve that stability. These limitations are particularly worrisome when we’re thinking about success—about targets, goals, and outcomes. When actions must be justified in the language of data, then the limitations inherent in data collection become limitations on human values.

Classification systems decide, ahead of time, what to remember and what to forget. But these categories aren’t neutral. All classification systems are the result of political and social processes, which involve decisions about what’s worth remembering and what we can afford to forget.

Public transparency requires that the reasoning and actions of institutional actors be evaluated by the public, using metrics comprehensible to the public. But this binds expert reasoning to what the public can understand, thus undermining their expertise. This is particularly problematic in cases where the evaluation of success depends on some specialized understanding. The demand for public transparency tends to wash deep expertise out of the system.

Think those questions are worth tackling.

Another thing that came to mind when reading your proposal: Heather Krause’s Data Biography. Think this would be especially useful for capturing (and normalizing) changes in metrics.

1 Like

Thank you for this very relevant and useful comment. I wasn’t aware of either Heather Krause or C. Thi Nguyen’s work. I read the material at those links and a little more and had a chance to do some thinking. I very much appreciate these pointers, they are super on target. Putting some notes here in case you are interested enough to follow and reply, but not making that assumption, this does get into some technical/philosophical weeds.

C. Thi Nguyen writes:

Data must be something that can be collected by and exchanged between different people in all kinds of contexts, with all kinds of backgrounds. Data is portable, which is exactly what makes it powerful. But that portability has a hidden price: to transform our understanding and observations into data, we must perform an act of decontextualization.

Notice that in this description the concept of “portability” is described in binary terms. There are observations that are contextualized and situated, and legible only to yourself or those who share your immediate context (these observations are “not data”), and observations that can be rendered legible from a distance, but at the risk of oversimplification, falsification, irrelevance, and incoherence (“data”).

Let’s label the problems of oversimplification, falsification, irrelevance, and incoherence as problems of trust. And let’s follow C. Thi Nguyen in remembering that trust problems are not problems from a God’s Eye View, they don’t float out there as a parameter defined in eternity, but are problems experienced by the reasoner who wants to rely on the observational records.

Our insight is that the scaling vs. trust problem can be modeled continuously. This approach was basically forced on us because we faced exactly the kinds of problems discussed by both C. Thi Nguyen and Heather Krause.

To get a taste for this way of thinking about things, imagine reflecting on your own experience in the most contextualized, immediate way. For instance, following C. Thi Nguyen’s food-related examples, imagine asking, as you take a bite: “do I like this?” Of course it would be idiotic to try to answer this question with data. But then again, there might be situations in which you want to examine your experience over time and keep track of your reflections: for instance, recovering from a COVID-related loss of smell, and wondering if your treatment is continuing to help you improve, or the effects have leveled off. This change over time can be hard to notice, so you make some notes. Now, you are in a situation that, on the tiniest scale, is analogous to Heather Krause’s curiosity about whether the written records she’s looking at are trustworthy.

Let’s scale up again. We know there is a replication crisis. This is usually sketched as a crisis of trust due to a mix of fraud, incompetence, and inadequate support and infrastructure for data sharing and peer-review in science. But ask any scientist you know involved in empirical research about replicating their own experiments. (You might have to build some trust with them first.) It turns out that the replication crisis scales back down to individual labs, and individual experiments, into the finest filaments of troubleshooting empirical research.

Simulated internal dialog:

“Wait, this data looks wrong, is it possible my research assistant put the sensors on in a vertical arrangement instead of horizontally as the documentation says? That would be bad.”

“Wait wait wait, how many other times have they done this?”

To get one more sense of moving up and down the scale, I think the long version of Heather Krause’s We All Count data biography is also useful. What you want from the information on this form is for it to follow the survey or other measurement data all the way through all the steps of the aggregation process. Otherwise, it will be as wrong as the context it purports to capture. What’s needed to truly accomplish this? We were led to trying to develop a low level protocol for securing provenance of observations by thinking about how to create probabilistics hooks that would work all the way down, at reasonable (or at least predictable) computational cost.

Finally I wanted to note on place where I found myself disagreeing strongly with C. Thi Nguyen:

“The power of data is that it is collectible by many people and formatted to travel and aggregate.”

This is descriptively correct about one aspect of the problem, but it is misleading when it mixes up “many people” with “travel and aggregate.” Whenever there is measurement of any kind there is traveling. And as soon as measurements are recorded, aggregation begins. From this very first point, you have reasons to doubt. The assumption that this is only happening at the group level is wrong.

If you’ve read this far, please feel free to ask questions or make additional comments.

Our engagement with the SoP discussion while developing this year’s application has both confirmed and challenged our framework for thinking about how a protocol approach could address some of the most pressing issues in the infrastructure for empirical discovery, esp. in science. As the application period comes to a close we wanted to sum this up, both for our own benefit and to provide further information and feedback to the organizers. @timbeiko’s questions about the internal/external tradeoffs, complexity/composability, and lifecycles of blockchain protocols were among the most provocative for us, and we just spent a session with our close collaborator Jakob Eg Larsen exploring the implications.

This made us think about a missing word in one of the most important sentences in blockchain discourse: “The solution we propose begins with a timestamp server.” The missing word is logical. The timestamps in the Bitcoin network are logical timestamps. They determine the relative temporal sequence of events inside the representational system, but there is no definition of calibration to real world time. Of course this calibration nonetheless occurs. The Bitcoin network is linked to real world events through Proof-of-Work. Computation is being done on real machines, which themselves have calibration to real world times, and the system displays a Unix epoch time format with second precision. That’s good enough, because the only thing that matters here is getting the sequence right.

You can see this solution as a version of bootstrapping, in both the positive and negative sense. Positive, because the Bitcoin network manages to put a timestamp server at the core of its architecture without ever having to troubleshoot the hookup to real world time; and negative, because bootstrapping, as a general concept, should always be used ironically: in real life it doesn’t mean “you don’t have to pay the entropic toll” but rather “you can hide the toll, accrue entropic debt, and maybe get others to pay it.” The entropic debt incurred by the Bitcoin network is paid through the only channel where it touches the real world; that is, through the real world computational process of Proof-of-Work. (There are some details here worth exploring, we think we have a sense of how the accounting works but would probably learn from others; in case this post is already getting too long!)

Proof-of-Stake systems are a different matter. We’re aware of some studies of Ethereum timestamps but we would not presume to instruct the experts on the optimal approach. However, we do notice that even in academic research the problem of real world calibration tends to be neglected in favor of logical timestamping. Since microsecond advantages have been leveraged in real world transactions for over a decade, understanding calibration of logical clocks to real world clocks may be important. In any case, the learning we’ve done here so far has also made us wonder about whether ideas for securing provenance of empirical observations could be relevant to some of the complexity/lifecycle discussions around blockchain generally. The questions of how blockchains touch the real world is a subset of larger questions about how explicit, codified, and generally accepted instructions (i.e. protocols) are instantiated. How tight can the grip between instructions and actions be? How tight should it be?

In our domain real world calibration of time stamps is indispensable. We must have it. That’s because a key benefit of secure digital provenance of empirical observations is that you get to align multiple time-series on a single timeline. If you want to find out about two different phenomena measured by two different instruments (or even about one phenomena measured in two different ways) the time series must line up. If you are dealing with phenomena like physical tremor you will be at 5-10 Hz. If you are dealing with ECG you might be over 100 Hz. I don’t know what finest temporal measurements the Large Hadron Collider is today, but they are certainly impossible to notarize using current tools. But the end of the scale represented by the first two should be within reach, and we should at least be able to calculate the cost of going further, maybe even all the way out. Where our interests align with SoP topics, perhaps it’s enough to say that awareness of these time calibration issues should help people be a little less naive about the ways in which digital protocols touch the real world.

Below we formulate some additional timestamp questions that would be interesting to explore during our research this summer, if selected:

  1. How can the network reasonably agree about useful bounds on physical time? What is “good enough” to catalyze a protocol transition in empirical research?
  2. What trade-offs are involved in decentralizing a timestamp server that makes assertions about real world time, not just logical time?
  3. How to characterize and explain the threat model surrounding that coupling between physical time and logical time, as it may affect existing blockchain protocols and, more importantly, as it may condition our ideas for improving trust protocols in empirical research?

This has been an interesting and productive encounter for us, and we hope we get a chance to continue.

datetime X