-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Adding support for polars for input data #6204
Comments
Thanks for using LightGBM and for taking the time for writing this up. I support Are you interested in contributing this? |
As a side note, passing data from polars without copying any data is the entire reason my PRs exist 😄 polars has a |
amazing!!! In the future, please share that context with us when proposing changes here. It helps us to make informed choices about how much time and energy to put into reviewing, and how to weight the costs of the new addition against the added maintenance burden. @borchero are you interested in adding |
Sure, will do next time 😄 since there already was an open Arrow PR, I didn't think about going into "motivation" too much 👀
The plan to pass polars to Arrow in our company's application code would have simply been to call
🙏🏼🙏🏼 |
The way I see it is that 2024 will be the year of polars adoption by major python ML packages. The easier you will make it for users to use it, the better the user experience will be overall. I am glad to hear that this is being considered and my issue wasn't rejected at first glance. |
On a different note, I tried to use LightGBM directly in rust https://github.com/detrin/lightgbm-rust and I will perhaps use it for use-case for testing. The pyarrow option is interesting, I will try it as well. Thanks @borchero could you link the PR here? |
This is a great point @borchero. Taking on I'm not that familiar with If not and it's literally just I guess as a first test, I'd want to understand how Because if it's a copy... then having Consider something like this (pseudo-code, this won't run): import polars as pl
import lightgbm as lgb
df = pl.read_csv("data.csv")
dtrain = lgb.Dataset(
df[["feature_1", "feature_2"]],
label=df["label"]
)
lgb.train(train_set=dtrain, ...) If I think that's result in higher peak memory usage than instead doing something like the following and passing in Arrow data directly import polars as pl
import lightgbm as lgb
df = pl.read_csv("data.csv").to_arrow()
dtrain = lgb.Dataset(
df[["feature_1", "feature_2"]],
label=df["label"]
)
lgb.train(train_set=dtrain, ...) Other things that might justify adding support for directly passing
|
Polars internally keeps memory according to the arrow memory format. When you call Moving data in and out of polars via arrow is zero-copy. Moving data in and out of polars via numpy can be zero-copy (it depends on the data type, null data and dimensions) |
Does this imply that potentially LigthGBM could use it even in my snippet above without allocating any new memory on the heap? |
@detrin not for your training data, i.e. not for polars data frames. Polars uses column-wise storage, i.e. each of your columns is represented by a contiguous chunk of memory (but data for each column is potentially in different locations of the heap). The only interface that is currently available to pass data to LightGBM from Python is via NumPy (disregarding the option to pass data as files), which uses a single array (=single chunk of memory) to store your data and uses row-wise storage. This means that each row is represented by a contiguous chunk of memory and rows are furthermore concatenated such that you end up with a single array. As you can see, Polars' data representation is quite different to the NumPy data representation and, thus, data needs to be copied. Side-note: to not require two copies, you should call The way to resolve this issue is to extend LightGBM's API, i.e. to allow other data formats to be passed from the Python API. Arrow is a natural choice since it is being used ever more and is the backing memory format for pandas. In fact, it allows you to pass data from any tool that provides data as Arrow to LightGBM without copying any data. |
This is not true. The Python package supports all of these formats for raw data:
Start reading from here and you can see all those options: LightGBM/python-package/lightgbm/basic.py Line 2010 in 516bde9
|
Also for reference https://numpy.org/doc/1.21/reference/arrays.ndarray.html#internal-memory-layout-of-an-ndarray. It says
"stored in a contiguous block of memory in row-major order" is not exactly the same as "row-wise", just wanted to add that link as it's a great reference for thinking about these concepts. |
Ah sorry about the misunderstanding! I think I phrased this a little too freely. I meant data formats that are useful for passing a Polars dataframe. Pandas is essentially treated the same as NumPy but adds a few more metadata. The other options are largely irrelevant for this particular instance.
Yep, thanks! Once one has read through the NumPy docs, one also understands the statement that "polars' |
So, if I understand it correctly as of now there is now way how to pass data from polars to LightGBM without copying the data in memory. For the project I am working on I might use CLI as a workaround. |
Yes, you will have (at least) one copy of your data in memory along with the LightGBM-internal representation of your data that is optimized for building the tree.
Potentially, a viable temporary alternative might also be to pass data via files (?) |
Is it possible directly in python? I could then output data into temp file and load it in python by LightGBM. |
See @jameslamb's comment above for LightGBM's "API":
You could e.g. write your data to CSV. Obviously, this introduces some performance hit. |
Shameless plug: |
@jameslamb I just thought again about adding documentation about how to pass I thought about adding a note on the |
Hey thanks for reviving this @borchero . A lot has changed since the initial discussion. There's now a I wonder... could we add direct, transparent support for def _is_polars(arr) -> bool:
return "polars." in str(arr.__class__) and callable(getattr(arr, "to_arrow", None))
# ... in any method where LightGBM accepts raw data ...
if _is_polars(data):
data = data.to_arrow() Because if we did that, then we wouldn't need to document specific methods that have to be called on |
This related discussion is also worth cross-linking: dmlc/xgboost#10554 |
Hi @jameslamb, Hope it's ok for me to jump in here - I contribute to pandas and Polars, and have fixed up several issues related to the interchange protocol mentioned in dmlc/xgboost#10452 The interchange protocol provides a standardised way of converting between dataframe libraries, but has several limitations which may affect you, so I recommend not using it:
If all you need to do is convert to pyarrow, then I'd suggest you just do if (pl := sys.modules.get('polars', None) is not None and isinstance(data, pl.DataFrame):
data = data.to_arrow() If instead you need to perform dataframe operations in a library-agnostic manner, then Narwhals, an extremely lightweight compatibility layer between dataframe libraries which has zero dependencies, may be of interest to you (Altair recently adopted it for this purpose, see vega/altair#3452, as did scikit-lego) I'd be interested to see how I could be of help here, as I'd love to see Polars support in LightGBM happen 😍 - if it may be useful to have a longer chat about how it would work, feel free to book some time in https://calendly.com/marcogorelli |
@ritchie46 pointed out this discussion to me, and I wanted to highlight recent work around the Arrow PyCapsule Interface. It's a way for Python packages to exchange Arrow data safely without prior knowledge of each other. If the input object has an You can use: import pyarrow as pa
assert hasattr(input_obj, "__arrow_c_stream__")
table = pa.table(input_obj)
# pass table into existing API Alternatively, this also presents an opportunity to access a stream of Arrow data without materializing it in memory all at once. You can use the following to only materialize a single Arrow batch at a time: import pyarrow as pa
assert hasattr(input_obj, "__arrow_c_stream__")
reader = pa.RecordBatchReader.from_stream(input_obj)
for arrow_chunk in reader:
# do something If the source is itself a stream (as opposed to a " The PyCapsule Interface could also let you remove the pyarrow dependency if you so desired. |
Thank you both for the information! I'm definitely supportive of having Both approaches mentioned above (@MarcoGorelli's and @kylebarron 's) look interesting for us to pursue here as a first step. Hopefully someone will be able to look into this soon (maybe @borchero or @jmoralez are interested?). We'd also welcome contributions from anyone involved in this thread who has the time and interest, and would be happy to talk through specifics over a draft PR. We have a lot more work to do here than hands to do it... we haven't even added support for pyarrow columns in For a bit more context.... The main place where that does happen with If you want to see what I mean by that, some of the relevant code is here: LightGBM/python-package/lightgbm/basic.py Line 807 in d67ecf9
LightGBM/python-package/lightgbm/basic.py Line 855 in d67ecf9
LightGBM/python-package/lightgbm/basic.py Line 867 in d67ecf9
|
Thanks for your response 🙏 ! Quick note on the two approaches: Narwhals exposes the PyCapsule interface (narwhals-dev/narwhals#786), and includes a fallback to PyArrow (thanks Kyle for having suggested it! 🙌 ) which arguably makes it easier to use than looking for
|
Summary
I think polars library is on the path to replace the majority of pandas use-cases. It is already being adopted by the community. We use it internally in my company for new projects and we try not to use pandas at all.
Motivation
Polars is blazingly fast and it has several times a lower memory footprint. There is no need to use extra memory to convert data into numpy or pandas to be used for training in LightGBM.
Description
I would like the following to be working, where
data_train
anddata_test
are instances ofpl.DataFrame
as of now I have to convert it into numpy matrices
The text was updated successfully, but these errors were encountered: