Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: cache some embedded DT info + generic bisect helper function #7067

Closed
wants to merge 7 commits into from

Conversation

etienne-lms
Copy link
Contributor

Parse the embedded DTB prior using it and cache information that can take time for libdft functions to find. Cached info is stored in sorted arrays so that we can bisect to find the target info. Cached information are:

The feature saves from few dozen to few hundreds of milliseconds of boot time depending on the DTB size (more specifically depending on the number of nodes in the DTB), for example, about 400ms boot time saved on STM32MP15/pager, and about 600ms saved on our STM32MP25 downstream config.

The patch series adds a generic bisect helper function and a test case in PTA test selftests. The test is proceed by CI on qemuv7 and qemuv8 paltform. These both platforms also embed a very small DTB in OP-TEE core so the DT cached info are also slightly tested (their embedded DTB only contains 7 nodes and 1 phandle).

Implement a generic array bisecting helper function, bisect_equal().

Change-Id: Ic0915b994fbed45fc3f3d7efd2b50991676166a1
Signed-off-by: Etienne Carriere <etienne.carriere@foss.st.com>
Add a bisect test sequence to OP-TEE selftest support in the test PTA.

The test checks bisect operation for present and absent cells, on
arrays from 0 to 65 cells.

Change-Id: I281da151eb9f6d5191e9f325e26a84f7e0f9593f
Signed-off-by: Etienne Carriere <etienne.carriere@foss.st.com>
Parse embedded DTB before drivers get information from and save
information on parent nodes (parent node offset, address and size
cells value) in order to speed up fdt_parent_offset() libfdt function
that parses the DTB from it's root node to fetch information mainly.
Also change fdt_reg_base_address(), fdt_reg_size() and
fdt_fill_device_info() to leverage this support.

The cached information is sorted by node offset value to allow an
efficient bisect sequence to retrieved desired information.

The cached resources are released once OP-TEE core initialization
is completed.

This feature is enabled upon configuration switch CFG_DT_CACHED_NODE_INFO.

Change-Id: I32c084afb384f9aafad16a6f642d39d07ed55069
Signed-off-by: Etienne Carriere <etienne.carriere@foss.st.com>
Parse embedded DTB before drivers get information from and save
information on nodes phandles if any in order to speed up
fdt_node_offset_by_phandle() libfdt function
that parses the DTB from its root node to find a node offset based
on the node phandle.

The cached information is sorted by phandle value to allow an
efficient bisect sequence to retrieved desired information.

The cached resources are released once OP-TEE core initialization
is completed.

This feature is added to CFG_DT_CACHED_NODE_INFO support

Change-Id: Ibf9f463c3c3e01e0225555ca5386f6c3104e646e
Signed-off-by: Etienne Carriere <etienne.carriere@foss.st.com>
Explicitly default enable CFG_DT_CACHED_NODE_INFO on plat-stm32mp1.

Change-Id: I9b63c3dc5073b5d89d7509a86670a6bb130aed76
Signed-off-by: Etienne Carriere <etienne.carriere@foss.st.com>
Explicitly default enable CFG_DT_CACHED_NODE_INFO on plat-stm32mp2.

Change-Id: I4cd1bf1da1a243f9e2d67bba635f9b52d6826b1a
Signed-off-by: Etienne Carriere <etienne.carriere@foss.st.com>
@etienne-lms
Copy link
Contributor Author

CI / Code style tet failure is due to false positive warnings on trace messages exceeding 80 char/line:

8deb8ab9d core: pta: test bisect implementation
WARNING: line length of 105 exceeds 80 columns
#98: FILE: core/pta/tests/misc.c:620:
+						LOG("-> Found unexpected target %d (size %zu, hide %zu)",

WARNING: line length of 102 exceeds 80 columns
#105: FILE: core/pta/tests/misc.c:627:
+						LOG("- Failed to find target %d (size %zu, hide %zu)",

WARNING: line length of 108 exceeds 80 columns
#110: FILE: core/pta/tests/misc.c:632:
+						LOG("- Wrongly found %d for target %d (size %zu, hide %zu)",

total: 0 errors, 3 warnings, 0 checks, 111 lines checked

Copy link
Contributor

@jforissier jforissier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parse the embedded DTB prior using it and cache information that can take time for libdft functions to find. Cached info is stored in sorted arrays so that we can bisect to find the target info.

Have you considered using a hash table? I have a feeling it would be even faster.

More comments below.

* @parent_offset: Output parent node offset upon success
* @return 0 on success and -1 on failure
*
* This function is supported when CFG_DT_CACHED_NODE_INFO is enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the cache feature should be more transparent. Ideally hook the cache into the FDT lib code. Or if we don't want to touch libfdt, make a wrapper. But the caller should not need to know if the cache is enabled or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine patch libfdt but I'm pretty confident such change would never hit libfdt mainstream.
The way fdt_reg_base_address() is implemented in our kernel/dt.h make that fully leverage adresse cell (to prevent a node property parsing) makes that this very function needs to be patched.
Same for fdt_reg_size() (to prevent 2 parsing of node properties) and fdt_fill_device_info() (2 parent node lookup + 2 node properties parsing).

That said, I can change this inline description to:

- * Find the offset of a parent node in the parent node cache
+ * Find the offset of a parent node in the parent node cache if any
 * (...)
- *
- * This function is supported when CFG_DT_CACHED_NODE_INFO is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the fixup commit that held your comment. The commit was noisy and buggy so I it replaced with another fixup commit that should work better.
Please feel free to use this comment thread if needed. I won't tag it 'Resolved' :)

* @cmp: Trilean comparision helper function applicable to @array
* @return: Pointer to the cell in @array matching @target, NULL if none found
*/
void *bisect_equal(const void *array, size_t n, size_t cell_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange name, how about bisect_find()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the name may sound strange.
I choose bisect_equal() since the function find a cell that matches the target in an equal case regarding cmp() callback.
I think there could be bisect_roundup(), bisect_rounddow() and bisect_nearest() for returning the closest matching cell.
That said, maybe bisect() straight could be applicable for the exact matching case.

Copy link
Contributor Author

@etienne-lms etienne-lms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parse the embedded DTB prior using it and cache information that can take time for libdft functions to find. Cached info is stored in sorted arrays so that we can bisect to find the target info.

Have you considered using a hash table? I have a feeling it would be even faster.

Why do you think a hash table would speed up the processed? We would still need to parse the hash table to find our target.

* @parent_offset: Output parent node offset upon success
* @return 0 on success and -1 on failure
*
* This function is supported when CFG_DT_CACHED_NODE_INFO is enabled.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine patch libfdt but I'm pretty confident such change would never hit libfdt mainstream.
The way fdt_reg_base_address() is implemented in our kernel/dt.h make that fully leverage adresse cell (to prevent a node property parsing) makes that this very function needs to be patched.
Same for fdt_reg_size() (to prevent 2 parsing of node properties) and fdt_fill_device_info() (2 parent node lookup + 2 node properties parsing).

That said, I can change this inline description to:

- * Find the offset of a parent node in the parent node cache
+ * Find the offset of a parent node in the parent node cache if any
 * (...)
- *
- * This function is supported when CFG_DT_CACHED_NODE_INFO is enabled.

* @cmp: Trilean comparision helper function applicable to @array
* @return: Pointer to the cell in @array matching @target, NULL if none found
*/
void *bisect_equal(const void *array, size_t n, size_t cell_size,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the name may sound strange.
I choose bisect_equal() since the function find a cell that matches the target in an equal case regarding cmp() callback.
I think there could be bisect_roundup(), bisect_rounddow() and bisect_nearest() for returning the closest matching cell.
That said, maybe bisect() straight could be applicable for the exact matching case.

Fix build error when CFG_EMBED_DTB is disabled.

Signed-off-by: Etienne Carriere <etienne.carriere@foss.st.com>
@jforissier
Copy link
Contributor

Parse the embedded DTB prior using it and cache information that can take time for libdft functions to find. Cached info is stored in sorted arrays so that we can bisect to find the target info.

Have you considered using a hash table? I have a feeling it would be even faster.

Why do you think a hash table would speed up the processed? We would still need to parse the hash table to find our target.

Ah yes, this is about speeding up the first lookup... So I'm not so sure. We are comparing:

  • [Bisecting] Parse the DT, for each value, copy the value into an array; then sort the array and bisect to find a value.
  • [Hashing] Parse the DT, for each value, apply a hash function, copy the hash and the value into an array of linked lists (buckets), then to find a value apply the hash function and read from the appropriate linked list.

With many lookups I believe the hash wins (especially with many values and a large enough hash table to have few collisions), but with only one lookup it's not obvious.

@etienne-lms
Copy link
Contributor Author

[Hashing] Parse the DT, for each value, apply a hash function, copy the hash and the value into an array of linked lists (buckets), then to find a value apply the hash function and read from the appropriate linked list.

Parsing a list to find a hash value seems less efficient (to me) than an array bisect operation. I must have missed something.

@jforissier
Copy link
Contributor

[Hashing] Parse the DT, for each value, apply a hash function, copy the hash and the value into an array of linked lists (buckets), then to find a value apply the hash function and read from the appropriate linked list.

Parsing a list to find a hash value seems less efficient (to me) than an array bisect operation. I must have missed something.

Hopefully there is no collision and in this case you reach the value you are looking for immediately (that is, the linked lists have a single element). A linked list is one possible implementation to address collisions: https://en.wikipedia.org/wiki/Hash_table#Separate_chaining

Copy link
Contributor

@jenswi-linaro jenswi-linaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment for "libutils: add bisect helper functions", this looks very much like the standard function bsearch(). Can we import bsearch() from newlib instead?

@etienne-lms
Copy link
Contributor Author

Thanks for the pointer. I was looking for a bisect open source implementation but failed to find a compliant one. I missed to look into newlib source tree. I'll remove my implementation in favor to this far more mature one.

Copy link
Contributor

@jenswi-linaro jenswi-linaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments for "core: dt: cache embedded DTB node parent information"


cuint = fdt_getprop(fdt, node_offset, "#address-cells", NULL);
if (cuint)
addr_cells = (int)fdt32_to_cpu(*cuint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the (int) cast needed?
If #address-cells isn't found, shouldn't the value used in the parent node be used instead?
Same below for #size-cells.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that libfdt does not go 1 parent-level up to get these information. It would make sense but I don't think this is how DT support is implemented.

parent_node_cache.alloced_count = 0;
}

static TEE_Result enlarge_parent_node_cache(void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grow_parent_node_cache()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

static struct {
struct parent_node_cache *array;
size_t count;
size_t alloced_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this field needed? It seems that count is always increased right after alloced_count has been increased.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I allocated struct by chunk of 64 instances, for some efficiency. Therefore I count the allocated and filled cells separately.

* @address_cells: Parent node #address-cells property value or 0
* @size_cells: Parent node #size-cells property value or 0
*/
struct parent_node_cache {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the parent part in the name confusing since the struct describes a node. How about struct cached_node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct intended to relate to parent node information of a node, hence this name.
I'll fix that to use a common struct for both parent node info and node phandle info. I'll use cached_node.

@jenswi-linaro
Copy link
Contributor

Please drop the Change-Id: tags.

@etienne-lms
Copy link
Contributor Author

I've tested use of a hash table, using a simplistic hash algo (some XOR on offset/phandle value bytes). It gives the same boot perfs on the platforms i've tested. Only adds some extra heap consumption.

By the way I've also tested few other schemes and I found that on my platforms with at most several hundreds of nodes in the DT (tested worse case with 1 thousand), scanning a list or an array of the preloaded structures, without any bisect or hash indices optimization, give the overall same results. All in one, I wonder if bisect or hash indices are really useful here, seen the relatively small size of the database (hundreds of nodes in the DTB).

I think I should simplify this to the minimum even is less elegant than using hashed or sorted arrays.

@jforissier
Copy link
Contributor

@etienne-lms that's interesting. If there is no clear benefit in performance I would also think the simplest algorithm is the best.

@jenswi-linaro
Copy link
Contributor

So you mean that not caching is as fast as caching?

@etienne-lms
Copy link
Contributor Author

No, I meant parsing an array or list of cached info is as fast as finding the cached info using bisect or or hash table.

@jenswi-linaro
Copy link
Contributor

OK, because I suspect there's some overhead in collecting the list.

@etienne-lms
Copy link
Contributor Author

Actually the overhead for bisect or hashing is very small. I'v removed these optims because they don't add much value (until a platform comes with a DTB with several thousands of node) and cost some more memory. We'll be able to integrate such optims if needed in the futur.

Since the changes needed in updating this series are quite big and revert commits, I preferred to create a new P-R instead of appending fixup commits. Please have a look at #7073. Consequently, I close this P-R.

@etienne-lms etienne-lms closed this Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants