Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear Set performance with respect to number of items already in the Cache #85

Open
thesilentg opened this issue Jul 2, 2020 · 7 comments

Comments

@thesilentg
Copy link

thesilentg commented Jul 2, 2020

When switching out a map[string]string with freeCache in order to reduce garbage collection overhead, I noticed some strange behaviors:

  1. Even though the published benchmarks here state that freeCache is about 2x faster for Set than using a map, I was actually seeing roughly 2x slower Set performance on my benchmarks when compared to the default map implementation
  2. When I deployed my application, I actually saw ~4x slower set performance when reading the data into the cache compared to the existing map implementation.

On (1), I realized that benchmarks here don't preallocate the map with any capacity. In my local testing, I saw that preallocating the map to the expected size increased benchmark performance by roughly 2x. Given that you must preallocate the freeCache to a desired size, it feels like the most accurate comparison would attempt to do something similar for the map. Otherwise, you're basically just testing the map's performance when dynamically resizing when new data is added, not really how fast it is at adding in new data to an already allocated map.

On (2), I did some more performance testing. Code below:

func BenchmarkExponentialCache(b *testing.B) {
	numItems := 100000000
	cacheSize := 12 * 1024 * 1024 * 1024
	split := 100000

	cache := freecache.NewCache(cacheSize)
	var key [8]byte
	lastTime := time.Now()
	for i := 0; i < numItems; i++ {
		binary.LittleEndian.PutUint64(key[:], uint64(i))
		cache.Set(key[:], make([]byte, 8), 0)
		if i % split == 0 && i != 0{
			fmt.Printf("Time to insert item %d through %d in milliseconds: %d\n", i - split, i, time.Since(lastTime).Milliseconds())
			lastTime = time.Now()
		}
	}
	fmt.Printf("Items: %d\n", cache.EntryCount())
}

func BenchmarkMap(t *testing.B) {
	numItems := 100000000
	split := 100000

	m := make(map[string][]byte, numItems)

	lastTime := time.Now()
	var key [8]byte
	for i := 0; i < numItems; i++ {
		binary.LittleEndian.PutUint64(key[:], uint64(i))
		m[string(key[:])] = make([]byte, 8)
		if i % split == 0 && i != 0{
			fmt.Printf("Time to insert item %d through %d in milliseconds: %d\n", i - split, i, time.Since(lastTime).Milliseconds())
			lastTime = time.Now()
		}
	}


	fmt.Printf("Items: %d\n", len(m))
}

This creates a 12GB freeCache, attempts to write 100 million items into it, and measures the time to write each 100 thousand items. It also does the same for the default map. This produces the following graph:

image

It appears that freeCache has roughly linear latency for calls to Set with respect to the number of items already present in the cache. It certainly explains the benchmark behavior. By default, the benchmarking tests here will never write enough items into the cache to show this latency regression.

This will bias the benchmark comparison between map and freeCache. It is only once ~2 million items have been written into the freeCache that the differences between the built in map and freeCache become apparent. In addition, you can see that writing the first ~1 million items into the map has very high latency. My guess is that this is almost entirely due to Garbage collection overhead, as until the map has a lot of items present, the default setting for GCPercent will force very frequent garbage collection cycles during the benchmark process, as the ratio of live memory to garbage generated is quite high. Once the map gets more and more populated and begins increasing the size of the live memory, the frequency of garbage collection will decrease quite significantly.

I have two open questions:

  1. Are you also able to validate this growing Set latency when more items are added?
  2. Is the linear latency for calls to Set expected? If so, can this be documented in the readme? If not, are there any potential workarounds to have constant (or at least sub-linear) Set latency with respect to the cache size?
@coocood
Copy link
Owner

coocood commented Jul 2, 2020

The CPU profile shows that memmove is the dominant cost for freecache, mallocgc is the dominant cost for map.
As more data populated to the memory, the memory cache miss rate increases, so the cost of memmove increases.
But mallocgc is always allocated sequentially, it doesn't have the cache miss issue.

@thesilentg
Copy link
Author

thesilentg commented Jul 2, 2020

Moving forward, is there a way to reduce the Set latency to enable fast inserts of large number of items? Our use case is storing tens to hundreds of millions of items in memory. These items all get loaded in at once from an S3 file when the server is started, so taking multiple minutes to read in all of this data in is less than ideal, especially since moving from ingesting x to 2x items takes more than 2x the time. Would a BulkSet method that takes in multiple keys and values at once be able to reduce the latency here?

@coocood
Copy link
Owner

coocood commented Jul 3, 2020

@thesilentg
You can launch multiple goroutines to Set the items concurrently, it may speed up the time a lot.

@thesilentg
Copy link
Author

thesilentg commented Jul 5, 2020

Thanks for the help. Moving to parallel inserts was able to shorten the overall duration. Although this is an improvement, it doesn't appear to change the asymptotic linear latency when calling Set. That is to say, latency for Set still increases as more items are written into the cache, even if we are able to write items faster using parallel processing. Are there any plans in the near future to address this?

@coocood
Copy link
Owner

coocood commented Jul 6, 2020

It requires large code modification.
So no plan yet.

@thesilentg
Copy link
Author

That makes sense. Would it be ok if I updated the README documentation in a separate PR to explain the write limitations here and link back to this issues? That way new users would be aware that concurrent inserts are helpful when inserting large numbers of keys and that write latency will increase as more items are added.

@coocood
Copy link
Owner

coocood commented Jul 8, 2020

@thesilentg
Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants