Skip to content

feat: implement DataWriter for Iceberg data files#552

Open
shangxinli wants to merge 1 commit intoapache:mainfrom
shangxinli:implement-data-file-writer
Open

feat: implement DataWriter for Iceberg data files#552
shangxinli wants to merge 1 commit intoapache:mainfrom
shangxinli:implement-data-file-writer

Conversation

@shangxinli
Copy link
Contributor

Implements DataWriter class for writing Iceberg data files as part of issue #441 (task 2).

Implementation:

  • Factory method DataWriter::Make() for creating writer instances
  • Support for Parquet and Avro file formats via WriterFactoryRegistry
  • Complete DataFile metadata generation including partition info, column statistics, serialized bounds, and sort order ID
  • Proper lifecycle management with Initialize/Write/Close/Metadata
  • PIMPL idiom for ABI stability

Related to #441

@shangxinli shangxinli force-pushed the implement-data-file-writer branch from 8944a75 to a201953 Compare January 31, 2026 17:59

ICEBERG_ASSIGN_OR_RAISE(writer_,
WriterFactoryRegistry::Open(options_.format, writer_options));
return {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is odd that an empty structure is always returned. Also, since this is initialization why not doing in the ctor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the initialization logic

Comment on lines 62 to 58
if (closed_) {
return InvalidArgument("Writer already closed");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see a case for making close idempotent, is there any strong reason why we want to return this error instead of no op for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return InvalidArgument("Writer already closed");
}
ICEBERG_RETURN_UNEXPECTED(writer_->Close());
closed_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this class address thread safety?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I've added explicit documentation that this class is not thread-safe:

Comment on lines 78 to 109
TEST_F(DataWriterTest, CreateWithParquetFormat) {
DataWriterOptions options{
.path = "test_data.parquet",
.schema = schema_,
.spec = partition_spec_,
.partition = PartitionValues{},
.format = FileFormatType::kParquet,
.io = file_io_,
.properties = {{"write.parquet.compression-codec", "uncompressed"}},
};

auto writer_result = DataWriter::Make(options);
ASSERT_THAT(writer_result, IsOk());
auto writer = std::move(writer_result.value());
ASSERT_NE(writer, nullptr);
}

TEST_F(DataWriterTest, CreateWithAvroFormat) {
DataWriterOptions options{
.path = "test_data.avro",
.schema = schema_,
.spec = partition_spec_,
.partition = PartitionValues{},
.format = FileFormatType::kAvro,
.io = file_io_,
};

auto writer_result = DataWriter::Make(options);
ASSERT_THAT(writer_result, IsOk());
auto writer = std::move(writer_result.value());
ASSERT_NE(writer, nullptr);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The two tests are quite similar, it is probably possible to leverage a function to reduce duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated the two tests using parameterized testing.

// Check length before close
auto length_result = writer->Length();
ASSERT_THAT(length_result, IsOk());
EXPECT_GT(length_result.value(), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: check the size of the data passed to the write function?

Copy link
Contributor Author

@shangxinli shangxinli Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Comment on lines 45 to 47
if (!writer_) {
return InvalidArgument("Writer not initialized");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!writer_) {
return InvalidArgument("Writer not initialized");
}
ICEBERG_PRECHECK(writer_, "Writer not initialized");

nit, this should make the code shorter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced all manual null checks with ICEBERG_PRECHECK

}

Result<FileWriter::WriteResult> Metadata() {
if (!closed_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use ICEBERG_CHECK here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG

EXPECT_GT(length.value(), 0);
}

} // namespace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this closing namespace curly before the first TEST_F?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@shangxinli shangxinli force-pushed the implement-data-file-writer branch 2 times, most recently from 90d324e to 153d763 Compare February 7, 2026 01:31
Implements DataWriter class for writing Iceberg data files as part of
issue apache#441 (task 2).

Implementation:
- Static factory method DataWriter::Make() for creating writer instances
- Support for Parquet and Avro file formats via WriterFactoryRegistry
- Complete DataFile metadata generation including partition info,
  column statistics, serialized bounds, and sort order ID
- Proper lifecycle management with Write/Close/Metadata methods
- Idempotent Close() - multiple calls succeed (no-op after first)
- PIMPL idiom for ABI stability
- Not thread-safe (documented)

Tests:
- 13 comprehensive unit tests including parameterized format tests
- Coverage: creation, write/close lifecycle, metadata generation,
  error handling, feature validation, and data size verification
- All tests passing (13/13)

Related to apache#441
@shangxinli shangxinli force-pushed the implement-data-file-writer branch from 153d763 to 147f25b Compare February 7, 2026 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants