Conversation
Register datafusion-spark's SparkILike UDF and add CometILike handler in stringExpressions. Custom escape characters fall back to Spark.
andygrove
left a comment
There was a problem hiding this comment.
Thanks for working on ILIKE support. The escape character check and the general approach look reasonable.
I have a question about whether CometILike actually gets matched during serde. In Spark, ILike is a RuntimeReplaceable expression. Its replacement is Like(Lower(left), Lower(right)). This means Spark's analyzer should replace ILike with Like(Lower(left), Lower(right)) before Comet sees the physical plan. If that's the case, the CometILike handler would be dead code, because the plan would already contain Like and Lower nodes instead of ILike.
The fallback test seems to confirm this. The expected fallback reason is "Comet is not compatible with Spark for case conversion", which comes from CometCaseConversionBase (used by CometLower), not from CometILike. So it looks like the expression is flowing through the existing CometLike + CometLower path. When spark.comet.caseConversion.enabled is true, Lower serializes successfully and Like handles the matching. When it's false, CometLower refuses to serialize and everything falls back.
If that's the case, the CometILike serde handler, the SparkILike UDF registration in jni_api.rs, and the ILike import in strings.scala may all be unnecessary. Could you check whether removing those pieces still makes the tests pass? It's possible that ILIKE already works today through the Like(Lower(...), Lower(...)) replacement path, gated by spark.comet.caseConversion.enabled.
If it turns out ILike does appear in the plan for some reason, then CometILike.convert() should probably check COMET_CASE_CONVERSION_ENABLED the same way CometCaseConversionBase does, since Arrow's ilike kernel uses Rust's to_lowercase() which has the same locale mismatch.
Summary
ilikefrom thedatafusion-sparkcrate (SparkILike) to Cometjni_api.rsand add serde mapping inQueryPlanSerde.scalaCometILikeinstrings.scalato reject non-default escape charactersspark.comet.caseConversion.enabledis disabled (default), ILIKE falls back to Spark to avoid incompatibilities with Rust's Unicode-basedto_lowercase()vs Java's locale-aware rules (e.g. Turkish I)Please let me know if the
caseConversionhandling is not appropriate