-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compilation error for navi10 (use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD') #775
Comments
I ended up modifying // buffer resource
#ifndef __HIP_DEVICE_COMPILE__ // for host code
#define CK_BUFFER_RESOURCE_3RD_DWORD -1
#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx908__) || \
defined(__gfx90a__) || defined(__gfx940__) || defined(__gfx941__) || \
defined(__gfx942__) // for GPU code
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
#elif defined(__gfx1030__) || defined(__gfx1010__) // for GPU code
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
#elif defined(__gfx1100__) || defined(__gfx1101__) || defined(__gfx1102__) // for GPU code
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31004000
#endif Then, the build was successful. |
@TyraVex Thanks for reporting this issue. Please let us know if it works or not. We do not have Navi10 GPUs to test it. |
Sorry for the late response. Here are the test results
|
In the There is missing support for gfx941, gfx942, gfx1012 and gfx1030 Folks will be unhappily surprised when MIOpen and similar users of CK fail. |
Our experiments at Solus have found the following:
|
@GZGavinZhao Its interesting that the So if I could just get to understand that we are hitting a hardware limit with gfx1010 (RX 5700 and the likes, Navi10 processor) that it can not do what the gfx1030 (RX6900 and the likes Navi21,23) is doing or its just someones idea for business strategy that gfx1010 is too old to support or whats the case? if its a hardware limitation where did we hit a dead end? why the software is not shipped as well as for gfx1010, im just trying to understand what is going on, whos deciding the stuff? I know there have been a tremendous amount of efforts all over the place which i was not part of, and I appreciate everyone contribution to where things have gone so far and very thankful to everyone's effort but we need to fix this issue, old GPUS are not going to the trash man (if the plan is to send them to the trash, then too bad of a waste)! I just read some main instruction set parts for memory shader and some other registers https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna-shader-instruction-set-architecture.pdf both AMD-GCN-GFX10-RDNA2 and AMD-GCN-GFX10-RDNA1 seems to have an identical instructions sets, I understand there are some details i missed, but from a hardware point of view, everything seems to be existing already in gfx1010 so our problem is software, so probably just copying the gfx1030 stuff and use it for gfx1010 and maybe without modifying anything (or with modifiying few things which is a nightmare:)) we would get a better support for that Navi10 (that gddr6 monster can do crazy stuff, so people are using older software by adopting to the gfx900 arch, well its still a toy in comparison to the MI300X) the problem is that this is not an easy job to do, it should have been considered earlier but we can maybe still do something all together, as the projects involved maybe beyond my capabilities, like: 1-LLVM |
I just hit this error while trying to add the gfx1030 features to gfx1010,
which is expected, as I modified the AMDGPU Target for that arch, so now my question, which one is the cached asm caps and which one derived from where? I passed through the error by forking the Tensile repo and IgnoringAsmCap mismatch |
Could you raise an issue in Tensile for this ? |
Thanks @trixirt , yes, I will also add this issue to it: below is actually the compileArgs command, and the first element is None, which i guess is the cxx_compiler(it should not be None):
|
I agree this issue/discussion should continue in Tensile, but just to add my 2 cents:
RDNA1 and RDNA2 are not the same instruction set. For example, consult section 6.3 of the RDNA1 and RDNA2 ISA architecture. You will see that RDNA2 has
As per my reasoning above, in my understanding it is a hardware limit that Therefore, any attempt at trying to compile and run code intended for RDNA2 on RDNA1 will eventually fail. The best we can do (and what it's already been done at Solus) is to patch libraries like rocBLAS, Tensile, and CK so that they can compile and run RDNA1 code on a best-effort basis. I believe Solus's support for RDNA1 hardware is complete. I assume you own a RDNA1 hardware, so feel free to grab a Solus ISO, install the |
Thanks for having a look @GZGavinZhao appreciate all your efforts.
Actually regarding the above I have a feeling that we can build these instructions for the gfx1010, you referred to the VOP3P instruction which can also be built for that target, it could be all what im doing is none sense, but I like trying stuff and let me see where that takes me :) I will keep you posted either way.
i think Solus is definitely helpful but currently running the gfx1010 on gfx900 is a downgrade. It would be my last resort to do that nevertheless I really like what you guys are doing at Solus looks a good approach for a better GPU support, I have plenty of GPUs as I work as a repairman for myself. And thus i ended up with plenty of Graphic cards from all generations. So i need to be fair to myself :) |
Note that only for CK, I do agree that for CK specifically, we may be able to treat |
Yes I noticed that, and that what brought me here :) I think there is a room of improvement in general :) the repos are huge and there are a lot of stuff like 20 repos :D |
Right now I'm just reading some weird stuff like this: |
That file is LLVM TableGen. The problem is again that By saying "then workaround specific cases where VDOT instructions are used", I meant instead of relying on hardware-accelerated instructions, write portable HIP code that are the equivalent of the VDOT operation. |
I think I understand what you mean, but you did not understand what I mean :) It worked! right now here is my results for building for ck gfx1030;gfx1010;gfx803 with all develop branches using llvm19 and clang19 a fresh make looks like this:
I guess now we will fail in some areas which we need to look at more closely ;) I have no idea how bad is this but it looks good for my eyes
I thank the Lord of Skies and Earth for what he gives and teaches and I also thank everyone for his help and contribution. we need more documentations, I do not even know how to run the ckPorfiler man. |
Really nice! I'm curious, what did you change? Did you only update CK, or did you also update the TableGen files in LLVM as well? |
Thanks! I actually changed few things in the target code for that, and only made one change in ck, cmakelists to treat gfx1010 as gfx1030, but now I would need to continue the process, I think I will leave it here for now, but at least you know now, that this is not impossible as you thought earlier but rather possible , even it took a lot of my energy, but I need to provide better quality service for my future customers at https://ropotato.com I think there will be more work to be done generally, but at least we have progress. |
Interesting. By "impossible", I meant it's impossible to compile CK as |
No man, gfx1010 is not run-able with these changes, do you get it now?
unless you modify the stuff like i did, but you keep changing what you said, and i dont like that, I will quit the discussion |
My apologies and I think there are misunderstandings from the start of this conversation. In my first response to you, I said
This still holds. If you compile code intended for RDNA2 ( What you seem to be trying to do is to make CK somehow run on RDNA1. This is a solved problem, see the diff below. As to why the changes you posted don't work, I never said that the changes you posted were the ones needed to make it work. For ROCm 6.0, all you need is a simple diff --git a/include/ck/ck.hpp b/include/ck/ck.hpp
index 9528a30b4..32551a8da 100644
--- a/include/ck/ck.hpp
+++ b/include/ck/ck.hpp
@@ -76,7 +76,7 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
// buffer resource
#ifndef __HIP_DEVICE_COMPILE__ // for host code
#define CK_BUFFER_RESOURCE_3RD_DWORD -1
-#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__)
+#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__) || defined(__gfx101__)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
#elif defined(__gfx103__)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
@@ -86,7 +86,7 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
// FMA instruction
#ifndef __HIP_DEVICE_COMPILE__ // for host code, define nothing
-#elif defined(__gfx803__) || defined(__gfx900__) // for GPU code
+#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx101__) // for GPU code
#define CK_USE_AMD_V_MAC_F32
#elif defined(__gfx906__) || defined(__gfx9__) || defined(__gfx103__) // for GPU code
#define CK_USE_AMD_V_FMAC_F32 If you want to be really precise on using all the optimizations available, it would be this: diff --git a/include/ck/ck.hpp b/include/ck/ck.hpp
index 9528a30b4..4eaeefdae 100644
--- a/include/ck/ck.hpp
+++ b/include/ck/ck.hpp
@@ -76,7 +76,7 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
// buffer resource
#ifndef __HIP_DEVICE_COMPILE__ // for host code
#define CK_BUFFER_RESOURCE_3RD_DWORD -1
-#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__)
+#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx9__) || defined(__gfx101__)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
#elif defined(__gfx103__)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
@@ -88,10 +88,12 @@ CK_DECLARE_ENV_VAR_BOOL(CK_LOGGING)
#ifndef __HIP_DEVICE_COMPILE__ // for host code, define nothing
#elif defined(__gfx803__) || defined(__gfx900__) // for GPU code
#define CK_USE_AMD_V_MAC_F32
-#elif defined(__gfx906__) || defined(__gfx9__) || defined(__gfx103__) // for GPU code
+#elif defined(__gfx906__) || defined(__gfx9__) || defined(__gfx103__) || defined(__gfx1011__) || defined(__gfx1012__) // for GPU code
#define CK_USE_AMD_V_FMAC_F32
#define CK_USE_AMD_V_DOT2_F32_F16
#define CK_USE_AMD_V_DOT4_I32_I8
+#elif defined(__gfx101__)
+#define CK_USE_AMD_V_MAC_F32
#elif defined(__gfx11__) || defined(__gfx12__)
#define CK_USE_AMD_V_FMAC_F32
#define CK_USE_AMD_V_DOT2_F32_F16 No changes in LLVM should be necessary. You can do everything in CK. The only concern I have is I don't understand how |
No problem at all @GZGavinZhao I think I understand your point of view now, I much appreciate all your good work out there man, dont get me wrong and have a pleasant time. All my best, |
Ok I found out that I messed up few things, also the diff above is missing on additional file:
depending what you want to do, if you want to go a bit deeper than a shallow clone then you would need to have a look and maybe modify few things over there,
The correct I hope that helps |
If you're targeting the
Would you mind telling me where did you find this information? I assume you're referencing the RDNA ISA architectures here? If so, could you please tell me the section number of the Instruction Set Architecture reference guide in which you found this information? I agree that it's highly likely |
Well, its good to find bugs, or issues as early as possible, so I'm trying to help myself and possibly others. So I would go for the amd-staging as I'm also in development mode and I need to change stuff anyway. But I get your point its definitely much easier to deal with a released version because then you may not need to build the whole stack (big difference of time for sure)
I got the information from the RDNA1 and RDNA2 architecture you posted above, obviously I'm not guessing, and FYI both have the same section number: Ok I obviously did not read the last sentence in your reply, so updating: 8.1.8 clarifies the BUFFER_RESOURCE 128-bit structure which are identical in both architectures and the 3RD_DWORD is derived from that, which are equivalent (of course I assumed few things are also equivalent, and that might not be true but the closest to it). Adding more explanation: So, the third DWORD (bits 96-127) contains the following fields:
So I may be totally incorrect in my assumptions but also I maybe correct in all the assumptions except for the Num_records Thats it for Today :) |
and here more detailed documentation, this one was hidden :D |
Ok I'm partially done, here is my results, now running on all latest develop as of two days ago :) i think the performance is not bad, but I dont know how it compares to the HSA_OVERRIDE to 1030, if anyone have the gfx1010 running on the latest official rocm versions I would like to see your results (7.3 seconds for a standard no-half generation on SD) : The torch version is actually main as of yesterday so its 2.5.rc0 I think the 2.3.1 is just the number i choose to let pip3 manage the stuff correctly i think there is still high disk latency as I'm using a physical hdd from 2009 for this test machine :D |
I applied patches to the ROCm 6.2 release to make composable kernels compile for the 5500 XT (gfx1012) as described, as well as unit tests for verification. It seems to work, but there's a lot more fine-tuning to do, since several tests failed the check.
If you want more information about this, or to see the output of these tests, ask away. |
UPDATE: The failed test results from CK have been compiled with HIPCC using GCC 12 headers on the Debian 12 Bookworm Docker image. Recompiling with GCC 14 headers on the most recent Debian Sid point image, only two tests failed. It's a major improvement but likely still unusable for production.
|
@TheTrustedComputer If you could run test with |
which patches did you go for the gfx900 one or the gfx1030 one? |
@GZGavinZhao Here you go. https://drive.google.com/file/d/1GxGLxHaJagvSSTukjjLzZMZs1UgtcOEE/view Above is a link to the verbose output of the two specific tests that failed, as it may be too long to post to GitHub. It looks like some of the four elements in these tests are returning zero instead of expecting non-zero output. Regardless, I hope you find it helpful in diagnosing whether it's a false positive or a legitimate one. If it's the former, perhaps you can modify these tests for cards with less VRAM? To clarify, I'm using the 8GB 5500 XT model and not the 4GB one. @waheedi I went for the gfx1030 patch to include the |
@TheTrustedComputer I'm glad you did that :) |
Here's my rough attempt at adding composable kernels to the gfx1010 family. It seems I faced the exact issue as ROCm/MIOpen#1528, but on a gfx1012 instead of the OP's gfx1031. So I followed ROCm/MIOpen@74e496d to hopefully get CK and MIOpen working together on RDNA1. However, it appears I already built and installed MIOpen without patching beforehand; the error message is identical as well. Also, I had to endure nearly 17 hours compiling CK for the gfx1012.
|
UPDATE: After patching |
@TheTrustedComputer ane progress on that? Do you plan on doing a pull request or something like that? |
@SzczurekYT Since RDNA1 GPUs are not officially supported, I have no plans to create an upstream pull request. However, if you wish, I can provide a diff file you can patch automatically with git. This will patch both composable kernels and MIOpen to introduce basic hardware compatibility for this family of cards. Compile for the architecture of your GPU. In your case, a 5600 XT is a gfx1011. Be prepared to wait several hours for it to complete. |
I don't think the architecture not being officially supoorted is a problem. The way I understand it is that the officially supported gpus are guaranteed to work, but that doesn't ban the community from submitting patches for other architectures.
That would be appreciated.
Interesting, as rocminfo says gfx1010
I can wait. |
Hello, I have some trouble to compile composable_kernel for my AMD GPU architecture (gfx1010)
Any ideas about a solution?
The text was updated successfully, but these errors were encountered: