Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symlink Metadata and Double Packing Avoidance #256

Closed
wants to merge 5 commits into from

Conversation

organizedgrime
Copy link
Contributor

Enclosed in this pull request are two relatively trivial changes to PrivateDirectory and PrivateFile which allow us to get more out of our rs-wnfs use case.

  1. The attach and cp functions within PrivateDirectory now have attach_link and cp_link versions. The issue we encountered while implementing this tool in Tomb is that despite our being able to correctly identify when files are duplicates of one another, copying them using the cp function still creates a new PrivateNode in the PrivateForest which, when encrypted and sent to the BlockStore employed, does not result in CID collisions and results in the double packing of the same file on disk. This is because the cp function described updates the ancestry and rotates the key associated with the PrivateNode being copied. This is not something we want to do, we want these CID collisions to occur in our use case so at to minimize our disk footprint. Thus, attach_link and cp_link were made as lightweight versions of the original function which maintain ancestry and keys across this linking/copying.
  2. The write_symlink function in PrivateDirectory and symlink_origin function in PrivateFile are also new. They create an interface for writing a PrivateFile to a PrivateDirectory where the contents of the file itself are empty, but encoded in the Metadata is information, a single String value associated with the key "symlink". write_symlink provides an interface for creating such PrivateFiles, symlink_origin provides an interface for retrieving an Option<String> for a given PrivateFile, which will either be the symlink path it points to, or nothing at all if it does not contain said Metadata.

Strictly speaking, if any of these changes are unwelcome we can implement them ourselves using a trait that is applied to the PrivateDirectory structure.

@organizedgrime organizedgrime requested a review from a team as a code owner May 5, 2023 17:29
@codecov
Copy link

codecov bot commented May 5, 2023

Codecov Report

Merging #256 (28c890d) into main (b3afbbc) will increase coverage by 0.27%.
The diff coverage is 66.66%.

❗ Current head 28c890d differs from pull request most recent head c9959bd. Consider uploading reports for the commit c9959bd to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #256      +/-   ##
==========================================
+ Coverage   60.13%   60.40%   +0.27%     
==========================================
  Files          36       36              
  Lines        2744     2743       -1     
  Branches      677      677              
==========================================
+ Hits         1650     1657       +7     
+ Misses        631      622       -9     
- Partials      463      464       +1     
Impacted Files Coverage Δ
wnfs/src/private/directory.rs 67.41% <ø> (ø)
wnfs/src/private/file.rs 80.27% <60.00%> (+0.34%) ⬆️
wnfs-common/src/metadata.rs 54.00% <100.00%> (+9.35%) ⬆️

Comment on lines 236 to 240
impl From<&NodeType> for String {
fn from(r#type: &NodeType) -> Self {
match r#type {
NodeType::PrivateDirectory => "wnfs/priv/dir".into(),
NodeType::PrivateFile => "wnfs/priv/file".into(),
NodeType::PublicDirectory => "wnfs/pub/dir".into(),
NodeType::PublicFile => "wnfs/pub/file".into(),
NodeType::TemporalSharePointer => "wnfs/share/temporal".into(),
NodeType::SnapshotSharePointer => "wnfs/share/snapshot".into(),
}
r#type.to_string()
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

Comment on lines +383 to +410
/// Create a new Symlink PrivateFile
pub async fn new_symlink(
path: String,
parent_bare_name: Namefilter,
time: DateTime<Utc>,
rng: &mut impl RngCore,
) -> Result<Self> {
// Header stays the same
let header = PrivateNodeHeader::new(parent_bare_name, rng);
// Symlinks have no file content
let content = FileContent::Inline { data: vec![] };
// Create a new Metadata object
let mut metadata: Metadata = Metadata::new(time);
// Write the original path into the Metadata HashMap
metadata
.0
.insert(String::from("symlink"), Ipld::String(path));
// Return self with PrivateFileContent
Ok(Self {
header,
content: PrivateFileContent {
persisted_as: OnceCell::new(),
metadata,
previous: BTreeSet::new(),
content,
},
})
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I know you must have thought of making it a new node type. Is there a reason you chose this approach instead?

@appcypher
Copy link
Member

Thanks for taking on this. We were going to work on this at some point but the spec is not there yet. For private file system in particular it would be nice to have a symlink that would not expose the secret information (the entire header in this case) of the target node.

@matheus23 has more ideas about how this should work. He is on vacation now but he is surely going to comment here when he is back. I see this PR reviving the discussion on the topic and could be a base for further work.

Comment on lines +1216 to +1239
async fn attach_link(
self: &mut Rc<Self>,
node: PrivateNode,
path_segments: &[String],
search_latest: bool,
forest: &mut Rc<PrivateForest>,
store: &impl BlockStore,
) -> Result<()> {
let (path, node_name) = crate::utils::split_last(path_segments)?;
let SearchResult::Found(dir) = self.get_leaf_dir_mut(path, search_latest, forest, store).await? else {
bail!(FsError::NotFound);
};

ensure!(
!dir.content.entries.contains_key(node_name),
FsError::FileAlreadyExists
);

dir.content
.entries
.insert(node_name.clone(), PrivateLink::from(node));

Ok(())
}
Copy link
Member

@appcypher appcypher May 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting use-case. So the in-memory tree structure might be good enough for fast copies but not updating the ancestry will leak secrets about unrelated directories. I'm trying to think of a way to handle this use-case. Will need to meet with @matheus23 for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...

I want to point out that this does something to the way that permissions work in WNFS:
In WNFS, each PrivateNode that's stored has a bunch of inumbers in its accumulator (the Namefilter in this version).
One can think of each inumber as a "path segment" in a path.
You can prove (to a third party) write access to a certain PrivateNode by providing a signature of an inumber, signed by the "root owner" of a WNFS.

If you simply copy over a PrivateRef from one place to another, you won't be updating the whole Namefilter structure (this is what the attach logic is doing!).
Concretely this means:
If you have a file at /Doc/Test.png and you symlink it to /Images/Test.png, then someone who has write access to /Images won't be able to modify Test.png! (Read access will work just fine)

This breaks an invariant in WNFS: That if you have write access to a directory, you'll have write access to its contents (recursively).

I guess in theory you could fix this by including both /Doc's and /Images' inumbers in the Test.png PrivateNode.

Can we go back to the drawing board on this? That'd probably be the "right" way of solving deduplication in rs-wnfs (at least on the file level). LMK if that makes sense to you @organizedgrime and let's brainstorm a bit about how this could look like API-wise (probably helpful if we do it live).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in theory you could fix this by including both /Doc's and /Images' inumbers in the Test.png PrivateNode.

I thought about this again and now I think this doesn't work (with name accumulators).

@matheus23
Copy link
Member

matheus23 commented May 15, 2023

About the symlink PrivateFile:
You should be able to implement both write_symlink, symlink_origin and PrivateFile::new_symlink outside of rs-wnfs.
Specifically, write_symlink is now enabled via open_file_mut:

let file = root.open_mut(...).await?;
file.get_metadata_mut().put("symlink", Ipld::(...));

(uh fuck we're still missing a get_metadata_mut function)


I hope you understand that I'm somewhat hesitant to merge a write_symlink function into rs-wnfs at this point in time, since in the future we're planning on figuring out exactly what the needs for symlinks are & implementing a dedicated spec for it.

For now: Let's meet live & talk about this again!

@matheus23
Copy link
Member

matheus23 commented May 19, 2023

Let me throw out another proposal: With cp_link and attach_link you're currently attempting to build deduplication on the Dir + File node level.

Would you be fine if the deduplication only happened on the file content level? I.e. you'd get efficient deduplication if there's two huge files in the tree you're persisting, but you wouldn't get deduplication when there's two directories with exactly equivalent content?

If that's the case, I'd propose a function PrivateFile::cp with (among others) a parameter reencrypt: bool. If this were set to false, it would simply take the FileContent::External & copy it, as well as walking all generate_shard_lables for the source file and copy its value in the PrivateForest over to the new keys, simply re-using the Cids of each block, instead of reading the file & re-encrypting. Another parameter would simply be a parent_name_filter (or when name accumulators land parent_name), similar to how PrivateFile::prepare_key_rotation works.

(If reencrypt is set to true, it does the same thing that PrivateDirectory::cp does with files.)

This trades off some metadata leakage (you can see that some blocks in the private forest are linked to from multiple places, perhaps giving you an idea of a file's size, e.g. if you expect this to only happen to a single file) for performance / storage.

@matheus23
Copy link
Member

This would be a more short-term thing that I'd be more happy in merging until we have a symlink specification that covers a bunch of use-cases.

@matheus23
Copy link
Member

Ok, coming out of the meeting it sounds like this

Would you be fine if the deduplication only happened on the file content level?

can be assumed for banyan's stuff atm.

Which makes me want to basically say: How about we do the PrivateFile::cp version. LMK if my explanation above isn't quite enough for figuring out how to implement it. Then I can help with that.

@matheus23
Copy link
Member

@organizedgrime thoughts on the idea in this comment? #256 (comment)

@matheus23
Copy link
Member

Having thought about this again, I wouldn't recommend my suggestion anymore.
Linking some ciphertext twice or more inside the private forest leaks information.
In an example:
Someone has a big file system with lots of files. The forest is full of blocks that can't be related to each other.
One file is copied over without reencryption. This causes the forest to have multiple entries with the same CIDs as values.
An attacker can go through the forest and look at which CIDs appear twice. This leaks the blocks relating to the copied file, as well as the copied file's exact file size.


We could try solving this in a layer slightly below the forest, but that'd need a spec extension. Actually it'd be quite similar to the approach used in this PR, so perhaps we revive it. In that case I'd prefer to keep it as a special case for files only, since there's no hierarchy for delegate write access from these files.
Needs a little more thought though.

@matheus23
Copy link
Member

We could try solving this in a layer slightly below the forest, but that'd need a spec extension.

This happened: wnfs-wg/spec#71
It's not an extension, it'll simply be specified once it lands.

Implementation is happening in #306, see this bullet point in the description:

  • Added base_name to ExternalFileContent. This allows users with snapshot access only to derive the necessary block labels (they wouldn't be able to access the header.name).

Once that PR lands, we can implement (very efficient) copying of PrivateFiles without re-encrypting.

@matheus23
Copy link
Member

Closing in favor of #319

@matheus23 matheus23 closed this Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants