Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFIDF.transform_many() fails on DataFrame input #1576

Open
bdewilde opened this issue Jul 16, 2024 · 2 comments
Open

TFIDF.transform_many() fails on DataFrame input #1576

bdewilde opened this issue Jul 16, 2024 · 2 comments

Comments

@bdewilde
Copy link

Versions

river version: 0.21.2
Python version: 3.11.7
Operating system: macOS 14.4

Describe the bug

The TFIDF feature extractor claims to support both online and mini-batch transformations, but the latter case only works when the transformer doesn't specify the on parameter. In other words, batch mode works for pd.Series input, but not pd.Dataframe.

Steps/code to reproduce

import pandas as pd
import river.feature_extraction

model = river.feature_extraction.TFIDF()
X = pd.Series(["foo bar bat baz", "foo bar spam eggs"])
for rec in X:
    print(model.transform_one(rec))
# WORKS
# {'foo': 0.5, 'bar': 0.5, 'bat': 0.5, 'baz': 0.5}
# {'foo': 0.5, 'bar': 0.5, 'spam': 0.5, 'eggs': 0.5}
print(model.clone().transform_many(X))
# WORKS
#    foo  bar  bat  baz  spam  eggs
# 0    1    1    1    1     0     0
# 1    1    1    0    0     1     1

model = river.feature_extraction.TFIDF(on="text")
X = pd.DataFrame([{"text": "foo bar bat baz"}, {"text": "foo bar spam eggs"}])
for rec in X.to_dict(orient="records"):
    print(model.transform_one(rec))
# WORKS
# {'foo': 0.5, 'bar': 0.5, 'bat': 0.5, 'baz': 0.5}
# {'foo': 0.5, 'bar': 0.5, 'spam': 0.5, 'eggs': 0.5}
print(model.clone().transform_many(X))
# DOES NOT WORK

That last call produces the following traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[95], line 1
----> 1 print(model.clone().transform_many(X))

File [~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py:349](http://localhost:8888/lab/tree/Desktop/notebooks/~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py#line=348), in BagOfWords.transform_many(self, X)
    347 for d in X:
    348     t: int
--> 349     for t, f in collections.Counter(self.process_text(d)).items():
    350         indices.append(index.setdefault(t, len(index)))
    351         data.append(f)

File [~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py:220](http://localhost:8888/lab/tree/Desktop/notebooks/~/.pyenv/versions/3.11.7/envs/ds/lib/python3.11/site-packages/river/feature_extraction/vectorize.py#line=219), in VectorizerMixin.process_text(self, x)
    218 def process_text(self, x):
    219     for step in self.processing_steps:
--> 220         x = step(x)
    221     return x

TypeError: string indices must be integers, not 'str'
@bdewilde
Copy link
Author

Well, I looked at the source code. Unfortunately, the problems go deeper than inconsistent type annotations:

While TFIDF implements its own .learn_one() and .transform_one() methods, which look to be doing the right thing, it inherits .learn_many() and .transform_many() from its parent BagOfWords class. The BagOfWords.learn_many() method is a no-op, because it doesn't need to learn anything; but TFIDF certainly does! And the BagOfWords.transform_many() method just creates a bag of words with integer counts from each record, so there's no term-frequency / inverse-doc-frequency weighting happening, as would be expected for the TFIDF transformer.

I'm going to side-step all of this for now by just iterating over the records one at a time, since the mini-batch operations are both invalid for TFIDF. If I find some spare time, I'll see about submitting a PR to fix.

@bdewilde
Copy link
Author

bdewilde commented Jul 19, 2024

I took a shot at implementing learn/transform many for the TFIDF transformer. Transformed data / predictions are quite similar when using these .*_many() methods over a batch compared to .*_one() equivalents in a stream, so I think they're doing the correct calculations. I'm less sure about if these implementations are "good" and/or in line with standards followed elsewhere in this lib. At the very least, here's what it could look like::

class MyTFIDF(river.feature_extraction.TFIDF):
    def learn_many(self, X: pd.Series) -> None:
        # increment global document counter
        self.n += X.shape[0]
        # update document counts
        doc_counts = (
            X.map(lambda x: set(self.process_text(x)))
            .explode()
            .value_counts()
            .to_dict()
        )
        self.dfs.update(doc_counts)

    def transform_many(self, X: pd.Series) -> pd.DataFrame:
        """Transform pandas series of string into tf-idf pandas sparse dataframe."""
        indptr, indices, data = [0], [], []
        index: dict[int, int] = {}
        for doc in X:
            term_weights: dict[int, float] = self.transform_one(doc)
            for term, weight in term_weights.items():
                indices.append(index.setdefault(term, len(index)))
                data.append(weight)
            indptr.append(len(data))

        return pd.DataFrame.sparse.from_spmatrix(
            scipy.sparse.csr_matrix((data, indices, indptr)),
            index=X.index,
            columns=index.keys(),
        )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant