Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segment Fault Happens when Applying Gradient Update on TransferCompound on CUDA in Debug Mode #632

Open
Zhaoxian-Wu opened this issue Mar 27, 2024 · 3 comments
Assignees
Labels
bug Something isn't working status:reviewing

Comments

@Zhaoxian-Wu
Copy link

Description

When I try to run optimizer.step() on TransferCompound in Debug mode, a segment fault occurs. It happens for the CUDA version.

How to reproduce

I followed the following steps:

  1. Compile the code in the debug mode
conda create -n aihwkit-cuda-dev python=3.10 -y
conda activate aihwkit-cuda-dev

git clone https://github.com/IBM/aihwkit.git ; cd aihwkit
pip install -r requirements.txt
conda install mkl mkl-include -y

export CXX=/usr/bin/g++
export CC=/usr/bin/gcc
export MKLROOT=$CONDA_PREFIX
export CMAKE_PREFIX_PATH=$CONDA_PREFIX
# export CUDA_VERSION=11.3
export CUDA_VERSION=11.1
export CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}
export CUDA_TOOLKIT_ROOT_DIR=${CUDA_HOME}
export CUDA_LIB_PATH=${CUDA_HOME}/lib64
export CUDA_INCLUDE_DIRS=${CUDA_HOME}/include
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
make build_inplace_cuda flags="-DRPU_DEBUG=ON"
  1. run the Python script main.py (provided below)
(aihwkit-cuda-dev) MrFive@server:~/Desktop/aihwkit$ python main.py 
/home/MrFive/Desktop/aihwkit/./src/aihwkit/__init__.py
RPUSimple<float>(3,2)
rpu.cpp:264 : RPUSimple constructed.
rpu_pulsed.cpp:96 : RPUPulsed constructed
rpu_pulsed.cpp:190 :     BL = 31, A = 1.79605, B = 1.79605
RPUSimple<float>(3,2)
rpu.cpp:341 : RPUSimple copy constructed.
cuda_util.cu:455 : Create context on GPU -1 with shared stream (on id 0)

cuda_util.cu:426 : Init context...
cuda_util.cu:434 : Create context on GPU 0
cuda_util.cu:245 : GET BLAS env.
cuda_util.cu:259 : CUBLAS Host initialized.
cuda_util.cu:1085 : Set (hsize,P,W,H): 2, 512, 8, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1255 : Assign host (hsize,P,W,H): 24, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
rpucuda.cu:93 : RPUCudaSimple constructed from RPUSimple on shared stream
cuda_util.cu:1085 : Set (hsize,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1085 : Set (hsize,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
rpucuda_pulsed.cu:64 : RPUCudaPulsed constructed
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 8, 512, 8, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 96, 512, 96, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 24, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 48, 512, 48, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1085 : Set (hsize,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1255 : Assign host (hsize,P,W,H): 24, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 96, 512, 96, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 24, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 48, 512, 48, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1085 : Set (hsize,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1255 : Assign host (hsize,P,W,H): 24, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1226 : Assign host (hsize,P,W,H): 16, 512, 16, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1085 : Set (hsize,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
RPUPulsed<float>[Transfer(2): SoftBoundsReference -> SoftBoundsReference](3,2)
rpu_pulsed.cpp:143 : RPUPulsed DESTRUCTED
rpu.cpp:288 : RPUSimple DESTRUCTED
cuda_util.cu:813 : Get SHARED float buffer ID 0, size 2, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:813 : Get SHARED float buffer ID 1, size 3, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1315 : Copy to host (hsize,P,W,H): 4, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:831 : Release SHARED float buffer ID 0, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:831 : Release SHARED float buffer ID 1, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:813 : Get SHARED float buffer ID 0, size 3, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:813 : Get SHARED float buffer ID 1, size 2, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:831 : Release SHARED float buffer ID 0, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:831 : Release SHARED float buffer ID 1, stream 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 12, 512, 48, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 24, 512, 96, 1
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 6, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 12, 512, 48, 1
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 24, 512, 96, 1
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 6, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 12, 512, 48, 1
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 2, 512, 8, 1
cuda_util.cu:1265 : Assign from CudaArray (S,P,W,H): 4, 512, 16, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1041 : CudaArray copy constructed.
cuda_util.cu:1085 : Set (hsize,P,W,H): 1, 512, 4, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1281 : Assign device (S, P,W,H): 6, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
bit_line_maker.cu:1541 : BLM init BL buffers with batch 1 and BL 31.
cuda_util.cu:1085 : Set (hsize,P,W,H): 2, 512, 8, 1
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:651 : Synchronize stream id 0
cuda_util.cu:1085 : Set (hsize,P,W,H): 2, 512, 16, 1
cuda_util.cu:1085 : Set (hsize,P,W,H): 3, 512, 24, 1
cuda_util.cu:651 : Synchronize stream id 0
Segmentation fault (core dumped)

Expected behavior

The code can run without error

Other information

main.py

import sys
sys.path.insert(0, './src')

import torch
import aihwkit
print(aihwkit.__file__)

# Imports from aihwkit.
from aihwkit.nn import AnalogLinear
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.configs import (
    UnitCellRPUConfig, 
    TransferCompound, 
    SoftBoundsReferenceDevice)


rpu_config = UnitCellRPUConfig(
    device=TransferCompound(
        unit_cell_devices=[
            SoftBoundsReferenceDevice(),
            SoftBoundsReferenceDevice(),
        ]
    )
)

in_dim = 2
model = AnalogLinear(2, 3, bias=True, rpu_config=rpu_config)

opt = AnalogSGD(model.parameters(), lr=0.1)

x = torch.ones(in_dim)
x = x.cuda()
model.cuda()

opt.zero_grad()
pred = model(x)
loss = pred.norm()**2
loss.backward()
opt.step()
  • Pytorch version: 2.1.2+cu121
  • Package version: 0.8.0
  • OS: Ubuntu 20.04.2
  • Python version: Python 3.10
  • Conda version (or N/A): conda 23.10.0
@Zhaoxian-Wu Zhaoxian-Wu added the bug Something isn't working label Mar 27, 2024
@maljoras
Copy link
Collaborator

maljoras commented Apr 2, 2024

@Zhaoxian-Wu indeed, debug mode might not be working with python. The debug mode is only used for C++ environments.

@Borjagodoy
Copy link
Collaborator

@Zhaoxian-Wu were you able to attend to @maljoras response and are you still having the problem?

@Zhaoxian-Wu
Copy link
Author

@Zhaoxian-Wu indeed, debug mode might not be working with python. The debug mode is only used for C++ environments.

I see. So what is the best practice for debugging the C++ code? Actually, when I was trying to develop some code to implement some functions on the analog tile update, I found I needed to print out some of the intermediate results to ensure everything was working correctly. I use the DEBUG_OUT(...) marco left in the C++ code to do so, which works only in the debug mode. So if I do want to debug, do you have any suggestion how I can print out some results in the C++ domain elegantly?

Thanks for your response and sorry for my late response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status:reviewing
Projects
None yet
Development

No branches or pull requests

3 participants