Skip to content

ATLAS-5032: Fix basic search when querying by long attribute values#650

Open
saksenasonali wants to merge 2 commits into
apache:masterfrom
saksenasonali:ATLAS-5032-fix-long-attribute-search
Open

ATLAS-5032: Fix basic search when querying by long attribute values#650
saksenasonali wants to merge 2 commits into
apache:masterfrom
saksenasonali:ATLAS-5032-fix-long-attribute-search

Conversation

@saksenasonali
Copy link
Copy Markdown

What changes were proposed in this pull request?

ATLAS-5032: Fix basic search for long qualifiedName with startsWith / endsWith / contains

Problem
Basic search with attribute filters on qualifiedName returns no results when filter values exceed Solr’s default max token length (255). This affects startsWith, endsWith, and contains, especially when multiple criteria on the same attribute are combined with AND (e.g. qualifiedName starts with a long prefix and ends with @primary).

Root cause: Solr ignores tokens longer than maxTokenLength, so index-based search does not match even though the entity exists and can be retrieved by GUID.

Solution (Approach 2 from ATLAS-5032)
For indexed string attributes, when the filter value length exceeds the configured Solr token limit, do not use the Solr index for STARTS_WITH, ENDS_WITH, or CONTAINS. Search falls back to JanusGraph instead.

Also ensure index and graph query paths stay consistent when the same attribute appears in multiple AND criteria:

Skip graph filter construction when the criterion is still index-searchable.
Skip index query construction when the criterion is not index-searchable.

How was this patch tested?
Unit / module tests
EntitySearchProcessorTest — 48 tests, including 6 new ATLAS-5032 scenarios (short and long qualifiedName, hive_table and hive_column, tokenized name, CONTAINS + ENDS_WITH).
Full repository module: mvn -pl repository test — 2391 tests, 0 failures.
mvn -pl common,repository -DskipTests install — build success.
Manual / REST (local Docker Atlas)
Reproduced the JIRA flow against http://localhost:21000:

Created a hive_table with a ~370-character name and qualifiedName default.@primary.
Basic search with AND:
qualifiedName startsWith default.<370-char-name>
qualifiedName endsWith @primary
Result: 1 matching entity (previously empty).

@@ -866,6 +871,10 @@ private boolean isIndexSearchable(FilterCriteria filterCriteria, AtlasStructType
} else if (operator == SearchParameters.Operator.CONTAINS && AtlasAttribute.hastokenizeChar(attributeValue) && indexType == null) { // indexType = TEXT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this patch updating isIndexSearchable method, it's good to reorder indexType before hastokenizeChar: Check indexType == null first so STRING-indexed attributes skip the full-string tokenize scan. Small performance win, no behavior change.

LOG.debug("{} operator found for string (TEXT) attribute {} and special characters found in filter value {}, deferring to in-memory or graph query (might cause poor performance)", operator, qualifiedName, attributeValue);

ret = false;
} else if ((operator == SearchParameters.Operator.STARTS_WITH || operator == SearchParameters.Operator.ENDS_WITH || operator == SearchParameters.Operator.CONTAINS) && attributeValue.length() > SOLR_MAX_TOKEN_STR_LENGTH) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's optional, Single CONTAINS block: Tokenize-char and max-length checks for CONTAINS are correct but split across two conditions. Merging them under one CONTAINS condition would improve readability.

LOG.debug("{} operator found for string (TEXT) attribute {} and special characters found in filter value {}, deferring to in-memory or graph query (might cause poor performance)", operator, qualifiedName, attributeValue);

ret = false;
} else if ((operator == SearchParameters.Operator.STARTS_WITH || operator == SearchParameters.Operator.ENDS_WITH || operator == SearchParameters.Operator.CONTAINS) && attributeValue.length() > SOLR_MAX_TOKEN_STR_LENGTH) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null safety (minor): The length check uses attributeValue.length() directly; a null-safe length check or using StringUtils.length would avoid a potential NPE.

public static final String INDEX_SEARCH_MAX_RESULT_SET_SIZE = "atlas.graph.index.search.max-result-set-size";
public static final String INDEX_SEARCH_TYPES_MAX_QUERY_STR_LENGTH = "atlas.graph.index.search.types.max-query-str-length";
public static final String INDEX_SEARCH_TAGS_MAX_QUERY_STR_LENGTH = "atlas.graph.index.search.tags.max-query-str-length";
public static final String INDEX_SEARCH_SOLR_MAX_TOKEN_LENGTH = "atlas.graph.solr.index.search.max-token-length";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid Solr-specific naming in INDEX_SEARCH_SOLR_MAX_TOKEN_LENGTH: Atlas can use different indexing backends, so embedding "Solr" in the constant (and related names like SOLR_MAX_TOKEN_STR_LENGTH and the config key atlas.graph.solr.index.search.max-token-length) ties the API to one implementation. Prefer a backend-neutral name such as INDEX_SEARCH_MAX_TOKEN_LENGTH / INDEX_SEARCH_MAX_TOKEN_STR_LENGTH, with a generic property key (e.g. atlas.graph.index.search.max-token-length) unless this limit is truly Solr-only.

private static final String LONG_TABLE_NAME =
"rltdvhrxhocivajyqojaxulwzhgzgzpkfolziacfnndzncjkthzeaeykdhhrjqdeibhdgiepwqkinvqzqxevushydtwjaabzgbfjmzvcbsoxewruhyhciyjefzsnokxvbeiiowzbhlkmujcnwilslgeswobzwwvkugyupsemqxsbdcmgrlpxmeuljvxyddvpccvcloupjorziwhogwnjvsdrwksvrbxcxjlcrcmrvvmbdmenafmvgrqzcaqbgpnhxiqbvxcbnudafsmjzvlzzzzpqmjkngbximmbjbijrqfb";

private static final String LONG_TOKENIZED_TABLE_NAME =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update LONG_TOKENIZED_TABLE_NAME to include tokenize characters: The value is currently all alphanumeric, so it does not satisfy hastokenizeChar and the name is misleading. Change the string to include a few special characters (e.g. ., :, or @) so it actually exercises the TEXT-index tokenize-char deferral path, not just the max token length boundary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants