Decentralizing the Sourcify Repository

19 January 2023 — Ethereum, Sourcify, Solidity, IPFS — 9 min read

(I'm updating this post time to time. Please check the commit history for changes.)

This is a post to brainstorm about how to further decentralize the Sourcify's verified contracts repository. This was sparked by our discussions with perama.eth based on their ideas laid out in a series of articles for a Time Ordered Distributable Database (TODD). TODD and other ideas have also been based on the amazing work done by TrueBlocks and their Unchained Index specification. Kudos to both of them. You don't need to read those resources to navigate this post but I recommend reading if you want to dig deeper.

Background

Sourcify is a decentralized and open-sourced Solidity source-code verification tool and an initiative for human-readable contract interactions. Now that's a mouthful. It has always been difficult to define what Sourcify is in a sentence because it is built on different pieces that come together.

The part I want to focus here is the decentralization. Unlike other block explorers with verification functionality, Sourcify's repository of verified contracts is published on IPFS. The root CID of the repository is periodically updated under the IPNS /ipns/repo.sourcify.dev. So anyone can easily download and pin the whole repository. In case Sourcify suddenly stops working, the older versions of the repo is accessible via other pinning services (thanks to web3.storage and estuary.tech) and volunteers (such as wmitsuda).

Now this is already an improvement from closed silos of verified contracts, having to scrape websites against their ToS etc. You know the deal. If Sourcify ceases to exist, the verified contracts will be available, yes. But still Sourcify is the single owner of the IPNS keys and more or less the single source of truth. Only Sourcify can update the repo IPNS. Everyone else is watching and following Sourcify and no one is able to contribute. I said "older versions of the repo" above on purpose. If Sourcify stops, the repo stalls.

This is where the inspiration from the TODD might be useful. TODD itself is inspired from the Unchained Index.

The Unchained Index

Unchained Index (by TrueBlocks) is a time-ordered address appearance index. An address appearance is whenever an address is found in a transaction in a block, even in deeper function calls.

Say the owner of the address 0xea123 calls the contract 0xcc123 at block 1,000,000 tx number 4 and this tx pays out your address 0xea456 some ether. On the next block 1,000,001 tx number 10 you send the ethers from 0xea456 to your cold wallet address 0xea789. Then the address appearance index will include:

Address	Block	Tx
0xea123	1,000,000	4
0xcc123	1,000,000	4
0xea456	1,000,000	4
0xea456	1,000,001	10
0xea789	1,000,001	10

...amongst other address appearances in the "chunk". A chunk is the periodically published new piece of information and in the case of Unchained Index a new chunk is published on every new 2,000,000 address appearances. Additionally, each chunk has a "bloom filter" associated. It is a cryptographic data structure that lets us ask "is my address 0xea456 included in this chunk?" and get the answer.

All this information is found in the "manifest" file. The file looks like this:

1{
2  "version": "trueblocks-core@v0.40.0",
3  "chain": "mainnet",
4  "schemas": "Qmart6XP9XjL43p72PGR93QKytbK8jWWcMguhFgxATTya2",
5  "databases": "Qmart6XP9XjL43p72PGR93QKytbK8jWWcMguhFgxATTya2",
6  "chunks": [
7    {
8      "range": "015013585-015016368",
9      "bloomHash": "QmREw5qaoucbVvEQzF71D44rXKzax9YgKuEEhZYHAYFZF5",
10      "indexHash": "QmTbFshRSdBFoC6AvBgzdRJ6Vgb9cVL3yTprYQ24XqHTqx"
11    },
12    {
13      // and so on...
14    }
15  ]
16}

The relevant part is the chunks array. The range contains the first block where the 1st address appearance is, and the last block where 2,000,000th address appearance is. The bloom hash is the CID of the chunk's bloom filter, and the index is the relevant data you need to download, if your address appears in this chunk.

The TrueBlocks client listens to the blockchain continuously, runs each transaction itself, and indexes the addresses whenever an address appears in a transaction. When the number of indexed address appearances hit 2,000,000 it appends the new chunk onto this chunks array and publishes the new manifest (i.e. the manifest IPFS CID).

The "head" of the index, i.e. the latest manifest CID is published inside a smart contract. The contract is permissionless, that is anyone can publish their own manifest, anyone can be a publisher. You'd choose a publisher you trust and see what they published. If you trust TrueBlocks, they announce their address 0xf503017d7baf7fbc0fff7492b751025c6a78179b and you look for the latest manifest CID by:

1string latestManifest = manifestHashMap["0xf503017d7baf7fbc0fff7492b751025c6a78179b"]["mainnet"];
2// latestManifest = "QmRGCuUaTH9yJTuGmgUv7N31qunLC4Vvqzvxyq1C1tMGF7"

Try it yourself

Now if I have an address 0xea456 and want to see on which tx's my address appears, I'd:

Ask the smart contract the latest manifest CID (published by TrueBlocks)
Download the manifest.
Download all bloom filters (significantly smaller than the whole index)
Ask each chunk's bloom filter if my address appears in that chunk
Download the relevant chunks

These actions can all be done with the Unchained Index's client chifra, and these actions rely on the existing index generated by someone else, in this case by TrueBlocks. But if you'd like, with chifra and a local Ethereum node, you can generate the whole Unchained Index yourself instead of downloading, and see if the manifested/published index actually matches yours.

Another neat feature of this structure is the decentralization of the data. When you download the relevant chunks (with their IPFS CID), the client (e.g. chifra) can pin the data on IPFS automatically so that the users also serve the data. They become "seeders" in Bittorrent terms. In fact this is the main reason why torrents work, people who wanted to consume become servers unknowingly.

Generalizing the Unchained Index

If you zoom out, you'll see this kind of publishing of time ordered chunks is a generalizable concept. Another instance of such a database is the "address-appearance-index" derived from the Unchained Index by perama.eth. Here, the periodically published "chunks" are renamed into "volumes", and the volumes contain functional "chapters", borrowing the nomenclature from book publishing. The chapters are groupings of the data you need to get that will contain the data you are looking for. If I'm looking for an address appearance of 0xbc1df..., I'd get the chapters 0xbc. The next chapter (that I don't need) would be 0xbd.

A volume is published every 100,000 blocks and a chapter contains addresses that share two starting characters 0xab.... Because it is two hex characters 16 x 16 = 256, and chapter will be the 1/256th of the whole data. If you want to get the address appearance of the address 0xbc1bdf... you will have to go through every volume, download the chapters 0xbc of every volume, and check your adderess' appearance. The goal is to nudge people to downloading more data than they need and pin. I don't need all other address appearances in the chapter 0xbc except 0xbc1dbf... but I have to. In turn, I start contributing to the network by serving the data.

Quoting from perama.eth:

If volumes are time based and chapters are targeted to user desires:

volume x
- Useful chapter <- User downloads this
- Other chapters
volume x + 1
- Useful chapter <- User downloads this
- Other chapters

A user wanting to get the address appearance of 0xbc1dbf will need:

volume 1, chapter 0xbc
volume 2, chapter 0xbc
…
volume 36, chapter 0xbc

In the case of Unchained Index, one knows which "chunks" (here "volumes") to download by asking the bloom filters: "is my address included in this chapter?". In address-appearance-index here the chapters are explicit and by default you download the relevant chapters from all volumes. If there are 36 volumes, you download 36 x chapter 0xbc.

Focusing Sourcify

Let's focus back to Sourcify. The problems were:

Problem 1: There is no single open database of verified contracts.
Problem 2: Yes, Sourcify repo is on IPFS but it can only be updated by Sourcify. People can’t contribute by default

Sourcify repo is also an "ever growing" repo with immutable files. A verified contract would never change, unlike a traditional database. We can also have a similar manifest and periodically announce the newly verified contracts. Whenever say, 1000 new contracts are verified on a chain, publish a new chunk and update the manifest CID on the smart contract.

Similar to Unchained Index, there can be multiple publishers. Additionally publishers can listen other publishers to keep their repositories in sync. Because each verifier have different providers of verified contracts and there's no single global source to follow and sync like a blockchain. Alice may verify their contract on Sourcify but Bob might choose to verify on Blockscout. This unfortunately splits the global database of verified contracts, and you need to check each verifier if you want to access the contract source code.

Publishers listening to each other and syncing makes sense because contract verification is deterministic. If, say, Blockscout publishes a new chunk with a new manifest CID, Sourcify can download the new chunk, run through each contract and verify them, and theoretically we should reach the same manifest CID or root CID. Next time, Sourcify publishes a new set of verified contracts, Blockscout verifies all contracts, gets (hopefully) the same CID and publishes on the smart contract.

1manifestHashes["mainnet"]["sourcify.eth"] = QmUv7N31qunLC4Vvqzvxyq1C1tMGF7...
2manifestHashes["mainnet"]["blockscout.eth"] = QmUv7N31qunLC4Vvqzvxyq1C1tMGF7...

This itself behaves like a blockchain. Someone announces a new block, others listen, run through it, and update their chain head. In fact ligi suggested a similar PoA chain mechanism back then.

This would solve the problems above:

Problem 1: There is no single open database of verified contracts.
- We have a single open database of verified contracts.
Problem 2: Yes, Sourcify repo is on IPFS but it can only be updated by Sourcify. People can’t contribute by default
- People can contribute by default by publishing new contracts.

But brings two other problems:

How do we structure the data?
What happens when there's a fork (i.e. manifest hashes don't match)?

Structuring the data

How is the Sourcify repository currently structured?

One can access the contract source with: https://repo.sourcify.dev/contracts/{matchType}/{chainId}/{address}/ https://repo.sourcify.dev/contracts/full_match/11155111/0x2738d13E81e30bC615766A0410e7cF199FD59A83/

or on IPNS: /ipns/repo.sourcify.dev/contracts/{matchType}/{chainId}/{address}/ https://ipfs.io/ipns/repo.sourcify.dev/contracts/full_match/11155111/0x2738d13E81e30bC615766A0410e7cF199FD59A83/

Where matchType is either full_match or partial_match. See more.

The IPNS resolves to the latest root CID of the repository. Currently, the IPNS is updated every 6 hours.

How can we structure the repo in the TODD way?

Perama's suggestion

perama suggests the following structure:

Publish a new volume every 1000 new contracts (on a single chain). Each volume is split into 256 chapters of two first characters of an address 0x00, 0x01 ... 0xff. Bear in mind that these parameters are just initial suggestions and need to be carefully chosen for the system to work.

volume 1
- chapter 0x00
- chapter 0x01
- ...
- chapter 0xff
...
volume N
- chapter 0x00
- chapter 0x01
- ...
- chapter 0xff

This means, if a user wants to get the contract 0xde0b295...b697bae they will have to download the chapter 0xde from all volumes. The goal is to, again, nudge people to downloading more data than they need and pin.

Unlike address appearances, I don't think this fits the Sourcify use case:

The reasoning for chapters is to make the user download slightly more data then they need and pin the data but is this needed and possible?
- I don't think the decentralization of data is going to be a problem if there are multiple publishers, pinning services, and volunteers.
- There is no client like chifra to "consume" the verified contract. The files being downloaded are just text files. The user just want to see a single contract, likely on a web interface. How can we make them run an ipfs node and pin the files?
A contract should be directly accessible with its chain+address identifier. Consider a block explorer frontend wanting to pull the verified contract with IPFS (Otterscan does this if you choose "Resolve IPNS" from top-right). The frontend should not be required to download all the chapters on all volumes to serve the source files.
I can see the number of volumes getting unreasonably high at some point, as the user need to download all the chapters of all volumes.

Without chapters

Remember that an address-appearance for an address is constantly growing. If you transacted with your address, your address appearance will be included in the new volume so you need to download the new volumes too. With contract source codes this is not the case. A contract address is only verified once and that's it. So IMHO, merging an already identifiable immutable data into a larger chunk (chapter) is not ideal.

That brings me to thinking, can we get away with just (again blockchain similarities) publishing the volumes (a block), that contains each new verified contract (transactions) and the new root hash of the whole repo (chain head block hash)?

Can we just publish:

volumes that contain the diff, i.e. the newly verified contracts
and the final root hash of the repo, i.e. the result after processing the diff.

Now if there's a root hash Qmf9Bv...c1Mq4 for a chain manifested by multiple publishers you trust you can do ipfs/Qmf9Bv...c1Mq4/contracts/full_match/0x2738d13E81e30bC615766A0410e7cF199FD59A83/ and get the contract.

Note that there is no chainId here because different publishers can choose to verify different chains.

How do publishers sync?

Sourcify receives 1000 new verified contracts, publishes the new contracts (the volume) and the root IPFS hash. Blockscout takes the volume, verifies the contracts, places them in the repo, publishes their root IPFS hash and hopefully we'll have the same hash.

Forks?

Now if everyone's using the same "verification client", no forks should happen. But already the way Sourcify verifies contracts is different than the way Blockscout does. So not all existing verified contracts in Blockscout can be verified by Sourcify too. Particularly if contract verification is done against an internal database. Ideally a contract verification should be reproducable and the verified contract folder should contain everything needed to reproduce a verification. This is what we like calling "edge verification". As a user you should be able to easily download the whole contract folder, and do the verification locally without trusting a third party.

But I digress and that's another topic. What I want to say is there needs to be a common way to verify contracts. I can for now only think of a social coordination to check for diffs in repos and see why repositories diverged.

I keep on thinking about this and appreciate if you have any ideas or feedback.