Skip to content

Add BaseFormatModelTest for FormatModel implementations#15441

Draft
joyhaldar wants to merge 2 commits intoapache:mainfrom
joyhaldar:file-format-api-tck
Draft

Add BaseFormatModelTest for FormatModel implementations#15441
joyhaldar wants to merge 2 commits intoapache:mainfrom
joyhaldar:file-format-api-tck

Conversation

@joyhaldar
Copy link
Contributor

@joyhaldar joyhaldar commented Feb 25, 2026

Adds base test class and tests for FormatModel implementations.

Changes

  • BaseFormatModelTest<T> - Base test class supporting (T) engine type
  • TestSparkFormatModel - Spark InternalRow round-trip tests
  • TestFlinkFormatModel - Flink RowData round-trip tests

Comment on lines 56 to 58
protected abstract Class<W> writeType();

protected abstract Class<R> readType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have different read and write type?

I would expect that we use generic Records for one, and the model specific type for the other

EqualityDeleteWriter<W> writer =
writerBuilder
.schema(TestBase.SCHEMA)
.engineSchema(writeEngineSchema(TestBase.SCHEMA))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this added?
I remember similar issues when I was working on the Spark model, but I also remember fixing it.
Do we need this at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests fail for AVRO without engineSchema with the error java.lang.IllegalArgumentException: Invalid struct: null is not a struct.

When I checked the code:

This is according to my understanding, please correct me if I am incorrect. Should I keep engineSchema in the tests, or should AVRO have a similar fallback?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create a similar fallback for Avro in an independent PR.
This is why these tests are good!


protected abstract Object readEngineSchema(Schema schema);

protected abstract List<W> testRecords();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave the responsibility of generating the test records in the base class. We might want to add different test data later (like array of maps, or struct of arrays), and I would love to see that the tests are run automatically.

Maybe something like the Flink DataGenerators could help here. Or a simple converter method?

InputFile inputFile = encryptedFile.encryptingOutputFile().toInputFile();
List<R> readRecords;
try (CloseableIterable<R> reader =
FormatModelRegistry.readBuilder(fileFormat, readType(), inputFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need engine specific reader for the positional deletes. We can just read with the generic reader.


protected abstract void assertEquals(Types.StructType struct, List<W> expected, List<R> actual);

protected abstract List<W> expectedPositionDeletes(Schema schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to test engine specific readers for position deletes. It is enough if the engine specific write is tested with the generic read

Copy link
Contributor

@rambleraptor rambleraptor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving the direction this is going!


@ParameterizedTest
@FieldSource("FILE_FORMATS")
public void testDataWriterRoundTrip(FileFormat fileFormat) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about creating a roundTrip method (or possibly several depending on the types)? Most of these roundTrip methods are trying to do the same things.

My gut feeling is that we'd use the roundTrip methods on a lot of different tests.

public class TestGenericFormatModels {
private static final List<Record> TEST_RECORDS =
RandomGenericData.generate(TestBase.SCHEMA, 10, 1L);
public abstract class TestBaseFormatModel<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that the visibility modifiers as strict as possible for classes, methods, attributes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants