Add BaseFormatModelTest for FormatModel implementations#15441
Add BaseFormatModelTest for FormatModel implementations#15441joyhaldar wants to merge 2 commits intoapache:mainfrom
Conversation
9abf7c7 to
1e3e8a7
Compare
| protected abstract Class<W> writeType(); | ||
|
|
||
| protected abstract Class<R> readType(); |
There was a problem hiding this comment.
Why do we have different read and write type?
I would expect that we use generic Records for one, and the model specific type for the other
| EqualityDeleteWriter<W> writer = | ||
| writerBuilder | ||
| .schema(TestBase.SCHEMA) | ||
| .engineSchema(writeEngineSchema(TestBase.SCHEMA)) |
There was a problem hiding this comment.
Why is this added?
I remember similar issues when I was working on the Spark model, but I also remember fixing it.
Do we need this at this point?
There was a problem hiding this comment.
Tests fail for AVRO without engineSchema with the error java.lang.IllegalArgumentException: Invalid struct: null is not a struct.
When I checked the code:
- For
AVRO,engineSchemais passed directly toSparkAvroWriterwith no null fallback (SparkFormatModels.java line 43) - For
Parquet: SparkParquetWriters.buildWriter has a fallback, if engineSchema is null, it converts from icebergSchema (SparkParquetWriters.java line 89)
This is according to my understanding, please correct me if I am incorrect. Should I keep engineSchema in the tests, or should AVRO have a similar fallback?
There was a problem hiding this comment.
We should create a similar fallback for Avro in an independent PR.
This is why these tests are good!
|
|
||
| protected abstract Object readEngineSchema(Schema schema); | ||
|
|
||
| protected abstract List<W> testRecords(); |
There was a problem hiding this comment.
I would leave the responsibility of generating the test records in the base class. We might want to add different test data later (like array of maps, or struct of arrays), and I would love to see that the tests are run automatically.
Maybe something like the Flink DataGenerators could help here. Or a simple converter method?
| InputFile inputFile = encryptedFile.encryptingOutputFile().toInputFile(); | ||
| List<R> readRecords; | ||
| try (CloseableIterable<R> reader = | ||
| FormatModelRegistry.readBuilder(fileFormat, readType(), inputFile) |
There was a problem hiding this comment.
We don't need engine specific reader for the positional deletes. We can just read with the generic reader.
|
|
||
| protected abstract void assertEquals(Types.StructType struct, List<W> expected, List<R> actual); | ||
|
|
||
| protected abstract List<W> expectedPositionDeletes(Schema schema); |
There was a problem hiding this comment.
We don't need to test engine specific readers for position deletes. It is enough if the engine specific write is tested with the generic read
…ations with Generic, Spark, and Flink tests
1e3e8a7 to
42ab761
Compare
rambleraptor
left a comment
There was a problem hiding this comment.
Loving the direction this is going!
|
|
||
| @ParameterizedTest | ||
| @FieldSource("FILE_FORMATS") | ||
| public void testDataWriterRoundTrip(FileFormat fileFormat) throws IOException { |
There was a problem hiding this comment.
What would you think about creating a roundTrip method (or possibly several depending on the types)? Most of these roundTrip methods are trying to do the same things.
My gut feeling is that we'd use the roundTrip methods on a lot of different tests.
| public class TestGenericFormatModels { | ||
| private static final List<Record> TEST_RECORDS = | ||
| RandomGenericData.generate(TestBase.SCHEMA, 10, 1L); | ||
| public abstract class TestBaseFormatModel<T> { |
There was a problem hiding this comment.
Make sure that the visibility modifiers as strict as possible for classes, methods, attributes
Adds base test class and tests for FormatModel implementations.
Changes
BaseFormatModelTest<T>- Base test class supporting (T) engine typeTestSparkFormatModel- Spark InternalRow round-trip testsTestFlinkFormatModel- Flink RowData round-trip tests