Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't include undef sym refs when building map of symbol definitions #629

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

andrewjcg
Copy link
Contributor

Previously, we'd count undefined symbols references in the map of symbols defined in a binary, which could cause e.g. py-spy to misattribute an undefined ref to _PyRuntime in some location other than libpython.so as the definition.

Previously, we'd count undefined symbols references in the map of
symbols defined in a binary, which could cause e.g. py-spy to
misattribute an undefined ref to `_PyRuntime` in some location
other than libpython.so as the definition.
andrewjcg added a commit to andrewjcg/py-spy that referenced this pull request Sep 3, 2024
… index

Summary:
Don't count undefined symbols in the index of symbols that py-spy builds.
This can causes e.g. py-spy to misattribute an undefined ref to `_PyRuntime`
in some location other than `libpython.so` as the definition.

Upstreamed as: benfred#629

Test Plan:
Ran on `/packages/cpu.xlformers.train/penv.par`.  Before, we'd die
with:

```
$ RUST_LOG=info ./fbpy-spy dump -p 1162
[2023-10-31T18:04:04.658254536Z INFO  py_spy::config] Command line args: ArgMatches { args: {}, subcommand: Some(SubCommand { id: [hash: B8461C91A07ADDC8], name: "dump", matches: ArgMatches { args: {[hash: CD5160AB4406C427]: MatchedArg { occurs: 1, source: Some(CommandLine), indices: [2], type_id: Some(TypeId { t: 69534013883876418352099503721857626982 }), vals: [[AnyValue { inner: TypeId { t: 69534013883876418352099503721857626982 } }]], raw_vals: [["1162"]], ignore_case: false }}, subcommand: None } }) }
[2023-10-31T18:04:04.660694834Z INFO  py_spy::python_spy] Got virtual memory maps from pid 1162:
[2023-10-31T18:04:07.033385523Z INFO  py_spy::python_spy] Found libpython binary @ /usr/local/fbcode/platform010/lib/libpython3.10.so.1.0
[2023-10-31T18:04:07.038415315Z INFO  py_spy::python_spy] got symbol Py_GetVersion.version (0x00007fa5a425acf0) from libpython binary
[2023-10-31T18:04:07.038425108Z INFO  py_spy::python_spy] Getting version from symbol address
[2023-10-31T18:04:07.039366641Z INFO  py_spy::version] Found matching version string '3.10.9+fb (3.10:1dd9be6, May  4 2022, 01:23:45) [Clang 12.0.1 (mononoke://'
[2023-10-31T18:04:07.039374857Z INFO  py_spy::python_spy] python version 3.10.9 detected
[2023-10-31T18:04:07.039380427Z INFO  py_spy::python_spy] got symbol _PyRuntime (0x000056301cf89000) from python binary
[2023-10-31T18:04:07.039498251Z WARN  py_spy::python_spy] Interpreter address from _PyRuntime symbol is invalid 0000000000000040
[2023-10-31T18:04:07.039503358Z INFO  py_spy::python_spy] Failed to get interp_head from symbols, scanning BSS section from main binary
[2023-10-31T18:04:07.154577459Z INFO  py_spy::python_spy] Failed to get interpreter from binary BSS, scanning libpython BSS
Error: Failed to find a python interpreter in the .data section
```

After:
```
$ RUST_LOG=info ./py-spy dump -p 1162
[2023-10-31T18:04:20.036236603Z INFO  py_spy::config] Command line args: ArgMatches { args: {}, subcommand: Some(SubCommand { id: [hash: B8461C91A07ADDC8], name: "dump", matches: ArgMatches { args: {[hash: CD5160AB4406C427]: MatchedArg { occurs: 1, source: Some(CommandLine), indices: [2], type_id: Some(TypeId { t: 69534013883876418352099503721857626982 }), vals: [[AnyValue { inner: TypeId { t: 69534013883876418352099503721857626982 } }]], raw_vals: [["1162"]], ignore_case: false }}, subcommand: None } }) }
[2023-10-31T18:04:20.038355392Z INFO  py_spy::python_spy] Got virtual memory maps from pid 1162:
[2023-10-31T18:04:22.319161826Z INFO  py_spy::python_spy] Found libpython binary @ /usr/local/fbcode/platform010/lib/libpython3.10.so.1.0
[2023-10-31T18:04:22.323992753Z INFO  py_spy::python_spy] got symbol Py_GetVersion.version (0x00007fa5a425acf0) from libpython binary
[2023-10-31T18:04:22.324001859Z INFO  py_spy::python_spy] Getting version from symbol address
[2023-10-31T18:04:22.324937137Z INFO  py_spy::version] Found matching version string '3.10.9+fb (3.10:1dd9be6, May  4 2022, 01:23:45) [Clang 12.0.1 (mononoke://'
[2023-10-31T18:04:22.324946474Z INFO  py_spy::python_spy] python version 3.10.9 detected
[2023-10-31T18:04:22.324951227Z INFO  py_spy::python_spy] got symbol _PyRuntime (0x00007fa5a42531b0) from libpython binary
[2023-10-31T18:04:22.325348234Z INFO  py_spy::python_spy] Found interpreter at 0x00007fa57daea000
[2023-10-31T18:04:22.325352986Z INFO  py_spy::python_spy] got symbol _PyRuntime (0x00007fa5a42531b0) from libpython binary
[2023-10-31T18:04:22.325356193Z INFO  py_spy::python_spy] Found _PyRuntime @ 0x00007fa5a42531b0, getting gilstate.tstate_current from offset 0x238
Process 1162: [xarexec] /packages/cpu.xlformers.train/penv.par -tt /dev/shm/uid-0/894107fb-seed-nspid4026533351_cgpid8628534-ns-4026533348/__run_xar_main__.py --model=genesis220B_kv8 --model.non_linearity=swiglu --model.use_rope=True --model.init.use_gaussian=True --model.init.use_depth=current --model.alpha_depth=disabled --optim.lr=0.00015 --optim.lr_min_ratio=0.1 --optim.warmup=2000 --seq_len=4096 --batch_size=4 --steps=476000 --unlimited_steps=False --log_freq=10 --eval_freq=-1 --profile_freq=-1 --dump_freq=50 --iter_type=multi --fp32_reduce_scatter=False --checkpoint_destination=directio --model_entity_id=-1 --do_checkpoint=True --model_parallel_size=8 --log_all_steps=True --gpu_check_level=-1 --tokenizer_dir=/mnt/wsfuse/tokenizers --periodic_gpu_check=False --data=/mnt/wsfuse/fair_llm_v2/shuffled/stackexchange:0.88,/mnt/wsfuse/fair_llm_v2/shuffled/b3g:3.15,/mnt/wsfuse/fair_llm_v2/shuffled/arxiv:1.14,/mnt/wsfuse/fair_llm_v2/shuffled/github_oss_with_stack:4,/mnt/wsfuse/fair_llm_v2/shuffled/c4/en:6,/mnt/wsfuse/fair_llm_v2/edouard_cc_20220927_new:24.7,/mnt/wsfuse/fair_llm_v2/ccnet_new:28.3,/mnt/wsfuse/fair_llm_v2/shuffled/wikipedia:3.5 --use_libuv=True --model_ckpt_multiplier=1 --optim_ckpt_multiplier=1 --dump_dir=/mnt/wsfuse/outputs/torchx-cpu-xlformers-h514mwh
Python v3.10.9 (/dev/shm/uid-0/894107fb-seed-nspid4026533351_cgpid8628534-ns-4026533348/runtime/bin/train#native-main#platform-runtime#python#py_version_3_10)

Thread 0x7FA5B5B8E000 (active): "MainThread"
    _single_tensor_adamw (torch/optim/adamw.py:466)
    adamw (torch/optim/adamw.py:335)
    step (torch/optim/adamw.py:184)
    _use_grad (torch/optim/optimizer.py:76)
    wrapper (torch/optim/optimizer.py:373)
    wrapper (torch/optim/lr_scheduler.py:68)
    main (train.py:761)
    manifoldfs_main_wrapper (train.py:296)
    inner (contextlib.py:79)
    <module> (train.py:1204)
    _run_code (runpy.py:86)
    _run_module_as_main (runpy.py:196)
    run_as_main (__par__/bootstrap.py:58)
    run_as_main (__par__/meta_only/bootstrap.py:76)
    __invoke_main (__run_xar_main__.py:91)
    <module> (__run_xar_main__.py:140)
Thread 0x7FA55F400000 (idle): "Thread-1"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FA562C00000 (idle): "Thread-2"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F9FC4600000 (idle): "Thread-3"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F9F85A00000 (idle): "Thread-4"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F9F85000000 (idle): "Thread-5"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F9F84600000 (idle): "Thread-6"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
```

Reviewers: bmaurer, kunalb, wenyinfu

Reviewed By: bmaurer

Subscribers: mzlee

Differential Revision: https://phabricator.intern.facebook.com/D50847131
andrewjcg added a commit to andrewjcg/py-spy that referenced this pull request Sep 3, 2024
… index

Summary:
Don't count undefined symbols in the index of symbols that py-spy builds.
This can causes e.g. py-spy to misattribute an undefined ref to `_PyRuntime`
in some location other than `libpython.so` as the definition.

Upstreamed as: benfred#629

Test Plan:
Ran on `/packages/cpu.xlformers.train/penv.par`.  Before, we'd die
with:

```
$ RUST_LOG=info ./fbpy-spy dump -p 1162
[2023-10-31T18:04:04.658254536Z INFO  py_spy::config] Command line args: ArgMatches { args: {}, subcommand: Some(SubCommand { id: [hash: B8461C91A07ADDC8], name: "dump", matches: ArgMatches { args: {[hash: CD5160AB4406C427]: MatchedArg { occurs: 1, source: Some(CommandLine), indices: [2], type_id: Some(TypeId { t: 69534013883876418352099503721857626982 }), vals: [[AnyValue { inner: TypeId { t: 69534013883876418352099503721857626982 } }]], raw_vals: [["1162"]], ignore_case: false }}, subcommand: None } }) }
[2023-10-31T18:04:04.660694834Z INFO  py_spy::python_spy] Got virtual memory maps from pid 1162:
[2023-10-31T18:04:07.033385523Z INFO  py_spy::python_spy] Found libpython binary @ /usr/local/fbcode/platform010/lib/libpython3.10.so.1.0
[2023-10-31T18:04:07.038415315Z INFO  py_spy::python_spy] got symbol Py_GetVersion.version (0x00007fa5a425acf0) from libpython binary
[2023-10-31T18:04:07.038425108Z INFO  py_spy::python_spy] Getting version from symbol address
[2023-10-31T18:04:07.039366641Z INFO  py_spy::version] Found matching version string '3.10.9+fb (3.10:1dd9be6, May  4 2022, 01:23:45) [Clang 12.0.1 (mononoke://'
[2023-10-31T18:04:07.039374857Z INFO  py_spy::python_spy] python version 3.10.9 detected
[2023-10-31T18:04:07.039380427Z INFO  py_spy::python_spy] got symbol _PyRuntime (0x000056301cf89000) from python binary
[2023-10-31T18:04:07.039498251Z WARN  py_spy::python_spy] Interpreter address from _PyRuntime symbol is invalid 0000000000000040
[2023-10-31T18:04:07.039503358Z INFO  py_spy::python_spy] Failed to get interp_head from symbols, scanning BSS section from main binary
[2023-10-31T18:04:07.154577459Z INFO  py_spy::python_spy] Failed to get interpreter from binary BSS, scanning libpython BSS
Error: Failed to find a python interpreter in the .data section
```

After:
```
$ RUST_LOG=info ./py-spy dump -p 1162
[2023-10-31T18:04:20.036236603Z INFO  py_spy::config] Command line args: ArgMatches { args: {}, subcommand: Some(SubCommand { id: [hash: B8461C91A07ADDC8], name: "dump", matches: ArgMatches { args: {[hash: CD5160AB4406C427]: MatchedArg { occurs: 1, source: Some(CommandLine), indices: [2], type_id: Some(TypeId { t: 69534013883876418352099503721857626982 }), vals: [[AnyValue { inner: TypeId { t: 69534013883876418352099503721857626982 } }]], raw_vals: [["1162"]], ignore_case: false }}, subcommand: None } }) }
[2023-10-31T18:04:20.038355392Z INFO  py_spy::python_spy] Got virtual memory maps from pid 1162:
[2023-10-31T18:04:22.319161826Z INFO  py_spy::python_spy] Found libpython binary @ /usr/local/fbcode/platform010/lib/libpython3.10.so.1.0
[2023-10-31T18:04:22.323992753Z INFO  py_spy::python_spy] got symbol Py_GetVersion.version (0x00007fa5a425acf0) from libpython binary
[2023-10-31T18:04:22.324001859Z INFO  py_spy::python_spy] Getting version from symbol address
[2023-10-31T18:04:22.324937137Z INFO  py_spy::version] Found matching version string '3.10.9+fb (3.10:1dd9be6, May  4 2022, 01:23:45) [Clang 12.0.1 (mononoke://'
[2023-10-31T18:04:22.324946474Z INFO  py_spy::python_spy] python version 3.10.9 detected
[2023-10-31T18:04:22.324951227Z INFO  py_spy::python_spy] got symbol _PyRuntime (0x00007fa5a42531b0) from libpython binary
[2023-10-31T18:04:22.325348234Z INFO  py_spy::python_spy] Found interpreter at 0x00007fa57daea000
[2023-10-31T18:04:22.325352986Z INFO  py_spy::python_spy] got symbol _PyRuntime (0x00007fa5a42531b0) from libpython binary
[2023-10-31T18:04:22.325356193Z INFO  py_spy::python_spy] Found _PyRuntime @ 0x00007fa5a42531b0, getting gilstate.tstate_current from offset 0x238
Process 1162: [xarexec] /packages/cpu.xlformers.train/penv.par -tt /dev/shm/uid-0/894107fb-seed-nspid4026533351_cgpid8628534-ns-4026533348/__run_xar_main__.py --model=genesis220B_kv8 --model.non_linearity=swiglu --model.use_rope=True --model.init.use_gaussian=True --model.init.use_depth=current --model.alpha_depth=disabled --optim.lr=0.00015 --optim.lr_min_ratio=0.1 --optim.warmup=2000 --seq_len=4096 --batch_size=4 --steps=476000 --unlimited_steps=False --log_freq=10 --eval_freq=-1 --profile_freq=-1 --dump_freq=50 --iter_type=multi --fp32_reduce_scatter=False --checkpoint_destination=directio --model_entity_id=-1 --do_checkpoint=True --model_parallel_size=8 --log_all_steps=True --gpu_check_level=-1 --tokenizer_dir=/mnt/wsfuse/tokenizers --periodic_gpu_check=False --data=/mnt/wsfuse/fair_llm_v2/shuffled/stackexchange:0.88,/mnt/wsfuse/fair_llm_v2/shuffled/b3g:3.15,/mnt/wsfuse/fair_llm_v2/shuffled/arxiv:1.14,/mnt/wsfuse/fair_llm_v2/shuffled/github_oss_with_stack:4,/mnt/wsfuse/fair_llm_v2/shuffled/c4/en:6,/mnt/wsfuse/fair_llm_v2/edouard_cc_20220927_new:24.7,/mnt/wsfuse/fair_llm_v2/ccnet_new:28.3,/mnt/wsfuse/fair_llm_v2/shuffled/wikipedia:3.5 --use_libuv=True --model_ckpt_multiplier=1 --optim_ckpt_multiplier=1 --dump_dir=/mnt/wsfuse/outputs/torchx-cpu-xlformers-h514mwh
Python v3.10.9 (/dev/shm/uid-0/894107fb-seed-nspid4026533351_cgpid8628534-ns-4026533348/runtime/bin/train#native-main#platform-runtime#python#py_version_3_10)

Thread 0x7FA5B5B8E000 (active): "MainThread"
    _single_tensor_adamw (torch/optim/adamw.py:466)
    adamw (torch/optim/adamw.py:335)
    step (torch/optim/adamw.py:184)
    _use_grad (torch/optim/optimizer.py:76)
    wrapper (torch/optim/optimizer.py:373)
    wrapper (torch/optim/lr_scheduler.py:68)
    main (train.py:761)
    manifoldfs_main_wrapper (train.py:296)
    inner (contextlib.py:79)
    <module> (train.py:1204)
    _run_code (runpy.py:86)
    _run_module_as_main (runpy.py:196)
    run_as_main (__par__/bootstrap.py:58)
    run_as_main (__par__/meta_only/bootstrap.py:76)
    __invoke_main (__run_xar_main__.py:91)
    <module> (__run_xar_main__.py:140)
Thread 0x7FA55F400000 (idle): "Thread-1"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FA562C00000 (idle): "Thread-2"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F9FC4600000 (idle): "Thread-3"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F9F85A00000 (idle): "Thread-4"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F9F85000000 (idle): "Thread-5"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F9F84600000 (idle): "Thread-6"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
```

Reviewers: bmaurer, kunalb, wenyinfu

Reviewed By: bmaurer

Subscribers: mzlee

Differential Revision: https://phabricator.intern.facebook.com/D50847131
andrewjcg added a commit to andrewjcg/py-spy that referenced this pull request Sep 3, 2024
… index

Summary:
Don't count undefined symbols in the index of symbols that py-spy builds.
This can causes e.g. py-spy to misattribute an undefined ref to `_PyRuntime`
in some location other than `libpython.so` as the definition.

Upstreamed as: benfred#629

Test Plan:
Ran on `/packages/cpu.xlformers.train/penv.par`.  Before, we'd die
with:

```
$ RUST_LOG=info ./fbpy-spy dump -p 1162
[2023-10-31T18:04:04.658254536Z INFO  py_spy::config] Command line args: ArgMatches { args: {}, subcommand: Some(SubCommand { id: [hash: B8461C91A07ADDC8], name: "dump", matches: ArgMatches { args: {[hash: CD5160AB4406C427]: MatchedArg { occurs: 1, source: Some(CommandLine), indices: [2], type_id: Some(TypeId { t: 69534013883876418352099503721857626982 }), vals: [[AnyValue { inner: TypeId { t: 69534013883876418352099503721857626982 } }]], raw_vals: [["1162"]], ignore_case: false }}, subcommand: None } }) }
[2023-10-31T18:04:04.660694834Z INFO  py_spy::python_spy] Got virtual memory maps from pid 1162:
[2023-10-31T18:04:07.033385523Z INFO  py_spy::python_spy] Found libpython binary @ /usr/local/fbcode/platform010/lib/libpython3.10.so.1.0
[2023-10-31T18:04:07.038415315Z INFO  py_spy::python_spy] got symbol Py_GetVersion.version (0x00007fa5a425acf0) from libpython binary
[2023-10-31T18:04:07.038425108Z INFO  py_spy::python_spy] Getting version from symbol address
[2023-10-31T18:04:07.039366641Z INFO  py_spy::version] Found matching version string '3.10.9+fb (3.10:1dd9be6, May  4 2022, 01:23:45) [Clang 12.0.1 (mononoke://'
[2023-10-31T18:04:07.039374857Z INFO  py_spy::python_spy] python version 3.10.9 detected
[2023-10-31T18:04:07.039380427Z INFO  py_spy::python_spy] got symbol _PyRuntime (0x000056301cf89000) from python binary
[2023-10-31T18:04:07.039498251Z WARN  py_spy::python_spy] Interpreter address from _PyRuntime symbol is invalid 0000000000000040
[2023-10-31T18:04:07.039503358Z INFO  py_spy::python_spy] Failed to get interp_head from symbols, scanning BSS section from main binary
[2023-10-31T18:04:07.154577459Z INFO  py_spy::python_spy] Failed to get interpreter from binary BSS, scanning libpython BSS
Error: Failed to find a python interpreter in the .data section
```

After:
```
$ RUST_LOG=info ./py-spy dump -p 1162
[2023-10-31T18:04:20.036236603Z INFO  py_spy::config] Command line args: ArgMatches { args: {}, subcommand: Some(SubCommand { id: [hash: B8461C91A07ADDC8], name: "dump", matches: ArgMatches { args: {[hash: CD5160AB4406C427]: MatchedArg { occurs: 1, source: Some(CommandLine), indices: [2], type_id: Some(TypeId { t: 69534013883876418352099503721857626982 }), vals: [[AnyValue { inner: TypeId { t: 69534013883876418352099503721857626982 } }]], raw_vals: [["1162"]], ignore_case: false }}, subcommand: None } }) }
[2023-10-31T18:04:20.038355392Z INFO  py_spy::python_spy] Got virtual memory maps from pid 1162:
[2023-10-31T18:04:22.319161826Z INFO  py_spy::python_spy] Found libpython binary @ /usr/local/fbcode/platform010/lib/libpython3.10.so.1.0
[2023-10-31T18:04:22.323992753Z INFO  py_spy::python_spy] got symbol Py_GetVersion.version (0x00007fa5a425acf0) from libpython binary
[2023-10-31T18:04:22.324001859Z INFO  py_spy::python_spy] Getting version from symbol address
[2023-10-31T18:04:22.324937137Z INFO  py_spy::version] Found matching version string '3.10.9+fb (3.10:1dd9be6, May  4 2022, 01:23:45) [Clang 12.0.1 (mononoke://'
[2023-10-31T18:04:22.324946474Z INFO  py_spy::python_spy] python version 3.10.9 detected
[2023-10-31T18:04:22.324951227Z INFO  py_spy::python_spy] got symbol _PyRuntime (0x00007fa5a42531b0) from libpython binary
[2023-10-31T18:04:22.325348234Z INFO  py_spy::python_spy] Found interpreter at 0x00007fa57daea000
[2023-10-31T18:04:22.325352986Z INFO  py_spy::python_spy] got symbol _PyRuntime (0x00007fa5a42531b0) from libpython binary
[2023-10-31T18:04:22.325356193Z INFO  py_spy::python_spy] Found _PyRuntime @ 0x00007fa5a42531b0, getting gilstate.tstate_current from offset 0x238
Process 1162: [xarexec] /packages/cpu.xlformers.train/penv.par -tt /dev/shm/uid-0/894107fb-seed-nspid4026533351_cgpid8628534-ns-4026533348/__run_xar_main__.py --model=genesis220B_kv8 --model.non_linearity=swiglu --model.use_rope=True --model.init.use_gaussian=True --model.init.use_depth=current --model.alpha_depth=disabled --optim.lr=0.00015 --optim.lr_min_ratio=0.1 --optim.warmup=2000 --seq_len=4096 --batch_size=4 --steps=476000 --unlimited_steps=False --log_freq=10 --eval_freq=-1 --profile_freq=-1 --dump_freq=50 --iter_type=multi --fp32_reduce_scatter=False --checkpoint_destination=directio --model_entity_id=-1 --do_checkpoint=True --model_parallel_size=8 --log_all_steps=True --gpu_check_level=-1 --tokenizer_dir=/mnt/wsfuse/tokenizers --periodic_gpu_check=False --data=/mnt/wsfuse/fair_llm_v2/shuffled/stackexchange:0.88,/mnt/wsfuse/fair_llm_v2/shuffled/b3g:3.15,/mnt/wsfuse/fair_llm_v2/shuffled/arxiv:1.14,/mnt/wsfuse/fair_llm_v2/shuffled/github_oss_with_stack:4,/mnt/wsfuse/fair_llm_v2/shuffled/c4/en:6,/mnt/wsfuse/fair_llm_v2/edouard_cc_20220927_new:24.7,/mnt/wsfuse/fair_llm_v2/ccnet_new:28.3,/mnt/wsfuse/fair_llm_v2/shuffled/wikipedia:3.5 --use_libuv=True --model_ckpt_multiplier=1 --optim_ckpt_multiplier=1 --dump_dir=/mnt/wsfuse/outputs/torchx-cpu-xlformers-h514mwh
Python v3.10.9 (/dev/shm/uid-0/894107fb-seed-nspid4026533351_cgpid8628534-ns-4026533348/runtime/bin/train#native-main#platform-runtime#python#py_version_3_10)

Thread 0x7FA5B5B8E000 (active): "MainThread"
    _single_tensor_adamw (torch/optim/adamw.py:466)
    adamw (torch/optim/adamw.py:335)
    step (torch/optim/adamw.py:184)
    _use_grad (torch/optim/optimizer.py:76)
    wrapper (torch/optim/optimizer.py:373)
    wrapper (torch/optim/lr_scheduler.py:68)
    main (train.py:761)
    manifoldfs_main_wrapper (train.py:296)
    inner (contextlib.py:79)
    <module> (train.py:1204)
    _run_code (runpy.py:86)
    _run_module_as_main (runpy.py:196)
    run_as_main (__par__/bootstrap.py:58)
    run_as_main (__par__/meta_only/bootstrap.py:76)
    __invoke_main (__run_xar_main__.py:91)
    <module> (__run_xar_main__.py:140)
Thread 0x7FA55F400000 (idle): "Thread-1"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7FA562C00000 (idle): "Thread-2"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F9FC4600000 (idle): "Thread-3"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F9F85A00000 (idle): "Thread-4"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F9F85000000 (idle): "Thread-5"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 0x7F9F84600000 (idle): "Thread-6"
    wait (threading.py:324)
    get (queue.py:180)
    _run (tensorboard/summary/writer/event_file_writer.py:225)
    run (tensorboard/summary/writer/event_file_writer.py:253)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
```

Reviewers: bmaurer, kunalb, wenyinfu

Reviewed By: bmaurer

Subscribers: mzlee

Differential Revision: https://phabricator.intern.facebook.com/D50847131
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants