Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limit number of records #22

Open
carlosMorell opened this issue Feb 11, 2024 · 0 comments
Open

limit number of records #22

carlosMorell opened this issue Feb 11, 2024 · 0 comments

Comments

@carlosMorell
Copy link

carlosMorell commented Feb 11, 2024

I have a problem reading parquet with codename/parquet/helper/ParquetDataIterator. When they are small, it reads them without problem. But after 500 records, it saturates the memory, and the whole process fails.

I have been reading the documentation, and making thousands of tests with dataIterator, and with parquetReader but it always fails with those parquet sizes, although it increases the machine and the memory.

I also tested to get only X number of rows, but I can't get it to work and there is no documentation about it.

Do you have any solution to use parquetDataIterator or parquetReader, limiting the number of records? Being able to request them in an orderly way from N in N records, without having to load everything in memory.

I am currently using these codes:

1)For parquetReader
`
use jocoon\parquet\ParquetReader;

// open file stream (in this example for reading only)
$fileStream = fopen(DIR.'/test.parquet', 'r');

// open parquet file reader
$parquetReader = new ParquetReader($fileStream);

// get file schema (available straight after opening parquet reader)
// however, get only data fields as only they contain data values
$dataFields = $parquetReader->schema->GetDataFields();

// enumerate through row groups in this file
for($i = 0; $i < $parquetReader->getRowGroupCount(); $i++)
{
// create row group reader
$groupReader = $parquetReader->OpenRowGroupReader($i);
// read all columns inside each row group (you have an option to read only
// required columns if you need to.
$columns = [];
foreach($dataFields as $field) {
$columns[] = $groupReader->ReadColumn($field);
}

// get first column, for instance
$firstColumn = $columns[0];

// .Data member contains a typed array of column data you can cast to the type of the column
$data = $firstColumn->getData();

// Print data or do other stuff with it
print_r($data);
}`

  1. for parquetDataIterator:

`use codename\parquet\helper\ParquetDataIterator;

$iterateMe = ParquetDataIterator::fromFile('your-parquet-file.parquet');

foreach($iterateMe as $dataset) {
// $dataset is an associative array
// and already combines data of all columns
// back to a row-like structure
}`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant