In Data We Trust

agaricus · March 26, 2024, 10:33pm

(post deleted by author)

n_a · April 1, 2024, 2:18am

Regarding

I am reminded of faust.

tblomseth · April 4, 2024, 3:11pm

Yes, thank you for the pointer to Roc, @n_a. Interesting to compare and contrast their approach to how we are thinking about it. At this stage, they’re very light in detail in terms of what they’re really doing—it’s all very stealthy—but it seems like their architecture is full stack going all the way from hardware up through software.

Looking at their approach, I’d say they are working in a different region of the trade-off space than we are, which will make certain things easier and other things harder. Hardware for sure is hard, but in a certain sense, if you’re opting for having your own hardware all the way down to where it’s touching reality, you’re also making some things easier for yourself. And then the trickiness moves into other dimensions. I think adoption will be a challenge for Roc.

What we’re trying to do is to approach the problem from a different angle, working from the premise that there are already a lot of instruments out there. It’s unlikely that we’ll be able to provide some kind of hardware that’ll go in and take over from all these, if not billions, then at least millions of instruments. The challenge is how can we offer as few primitives as possible from a protocol point of view that are capable of being integrated and used in existing hardware? That also means that there are a lot of constraints that we’ll have to adhere to that you don’t necessarily have when you’re doing your own hardware.

As Gary @agaricus mentioned above, we have been thinking a lot about how to support existing resource-constrained hardware, even network connection-constrained, where you might have intermittency in the network connection, which also poses some very interesting limitations on what you can do protocol wise. And then take those as a starting point to see how we can make some at least minimal progress towards ensuring data integrity and provenance for sensor data and metadata coming off of these instruments.

To a large extent, we come from a more traditional network protocol and distributed systems kind of thinking. This might be a faux pas in this company to say, but for us, what crypto offers in terms of the consensus protocols of blockchains are some useful techniques and mechanisms that we can use for solving some of our very specific challenges. In that sense, we might have less of an ideological stance when it comes to crypto, but more of a practical utilitarian stance. If you know your distributed systems theory, then you also know there are certain problems about coordination and consensus that weren’t really solved before the advent of crypto. And of course, you want to utilize the techniques that are most fit for purpose.

Also, there are some crypto techniques that, at least according to our current understanding and the experiments we’ve been doing so far, which are still too far off in terms of being mature and also are still too costly in terms of resource use. For instance, zero-knowledge techniques are definitely useful in certain contexts, but it seems like when we are in this domain of resource constrained instruments, then it’s basically too early. Also, when we’re not controlling the hardware, we can’t require things like secure enclaves.

So the question then becomes, how can a sufficient level of certainty around empirical data provenance and integrity be reached? What is both useful and a significant improvement over current practices?

I think there is a much more fundamental point to be made here, which probably is so ingrained in our thinking that we don’t say it often and loudly enough, which is that the coupling between the physical and the digital realm can only be probabilistic. For us, that’s a core epistemological tenet: you cannot make a hard coupling between the physical world and the digital realm, even if you control the hardware all the way down to the bottom. Probabilistic effects will always sneak in and until we might discover new fundamental theories of physics, we have to deal with the probabilistic nature of that coupling and not try to fool ourselves.

Going further, then there is the computational boundedness of both us as observers and also the machines we can build, which leads me to a comment on the idea of “notarize reality”—which might work as marketing slogan but which if taken at face value as a program statement is epistemologically naïve. To me it has the same ring to it as what we’ve been hearing about self-tracking for many years, which is, you can just “track all the data”, but there’s no such thing as all the data.

Due to our computational boundedness, for all practical purposes, reality to us appears like it has an infinite many aspects to it, which means that every time you build an instrument, every time you set up an experiment, every time you start monitoring something, there is a kind of editorial process going on, too. You have to choose what to measure. You cannot track everything. You cannot “notarize reality” in its totality. It’s off limits. We are limited to phenomena of interest. We are defining phenomena of interest and then we’re devising machines to measure those phenomena and having those machines turn those measurements into digital data. That’s how far we can go.

So given the probabilistic nature of the coupling between the physical and the digital realm and the infinitude of aspects of reality, we need a different kind of approach than epistemologically naïve slogans. How can we increase the certainty around the answers we get from empirical data? And how can we increase the certainty that these questions might be answered far down the line in the future? In principle, in a hundred years, you should be able to ask questions of these data, both the substance of the observational data, but also of the metadata and then hopefully be able to cross the threshold of being able to trust the data for your reasoning purpose.

natashagogo · April 4, 2024, 5:49pm

To build trust at scale, it’s important to be clear on what choices you’re making during data collection, and more specifically, what you’re not including and what questions you can’t answer with data. Read a stellar essay from C. Thi Nguyen on this point this morning, which I like to think of as the “how” to @davidtlang’s “why” in Standards Make The World.

Here are some relevant bits:

From a policy perspective, anything hard to measure can start to fade from sight. An optimist might hope to get around these problems with better data and metrics. What I want to show here is that these limitations on data are no accident. The basic methodology of data—as collected by real-world institutions obeying real-world forces of economy and scale—systematically leaves out certain kinds of information […] And these limitations aren’t accidents or bad policies. They are built into the core of what data is. Data is supposed to be consistent and stable across contexts. The methodology of data requires leaving out some of our more sensitive and dynamic ways of understanding the world in order to achieve that stability. These limitations are particularly worrisome when we’re thinking about success—about targets, goals, and outcomes. When actions must be justified in the language of data, then the limitations inherent in data collection become limitations on human values.

Classification systems decide, ahead of time, what to remember and what to forget. But these categories aren’t neutral. All classification systems are the result of political and social processes, which involve decisions about what’s worth remembering and what we can afford to forget.

Public transparency requires that the reasoning and actions of institutional actors be evaluated by the public, using metrics comprehensible to the public. But this binds expert reasoning to what the public can understand, thus undermining their expertise. This is particularly problematic in cases where the evaluation of success depends on some specialized understanding. The demand for public transparency tends to wash deep expertise out of the system.

Think those questions are worth tackling.

Another thing that came to mind when reading your proposal: Heather Krause’s Data Biography. Think this would be especially useful for capturing (and normalizing) changes in metrics.

agaricus · April 12, 2024, 7:48pm

(post deleted by author)

Topic		Replies	Views
Causality Network: IoT driven data collection in science SoP 2024 RFC pig	1	211	April 8, 2024
ROTW: Crypto's Three Body Problem General rotw	9	762	February 13, 2024
[PIG] RFC: Consent Protocols SoP 2024 RFC pig	14	419	April 23, 2024
PIG: Autonomous Realities SoP 2024 RFC pig , seeking-teammate	4	549	April 12, 2024
PILL: The Alchemical Engine - Protocols as the Esoteric Language of Technology SoP 2024 RFC pill	2	145	April 5, 2024

In Data We Trust

Related topics