Unable to Query the Required Trace

TOC

Problem Description

When querying the trace in a service mesh, you may encounter situations where the target trace cannot be retrieved.

Root Cause Analysis

1. Trace Sampling Rate Configured Too Low

When the sampling rate parameter for the trace is set too low, the system will only collect trace data proportionally. During times of insufficient request volume or low-peak periods, this may lead to the sampled data being below the visibility threshold.

2. Elasticsearch Real-Time Limitations

The default configuration for Elasticsearch index is "refresh_interval": "10s", which results in a delay of 10 seconds before data is refreshed from the memory buffer to a searchable state. When querying recently generated traces, the results may be missing because the data has not yet been persisted.

This index configuration can effectively reduce the data merge pressure on Elasticsearch, improving indexing speed and the speed of the first query, but it also reduces the real-time nature of the data to some extent.

3. Improper Query Condition Settings

When performing trace queries, if the technical principles behind the Span kind parameter are not well understood, it may result in no data being returned. Therefore, it's not recommended to use this parameter arbitrarily. Especially when both Client and Server are specified, it can lead to empty query results.

Example 1: Span kind set to Root Span with both Client and Server specified

In this case, the query will return no data. The reason is that when both the client and server are governed by OTel Agent, the root span of the trace is typically on the client side, and server data will not be retrieved. To resolve this, remove the Server condition or avoid selecting Root Span.

Example 2: Span kind set to Service Entry Span with both Client and Server specified

Similarly, this query will also return no data. The reason is that when both the client and server have a Sidecar injected, the Service Entry Span refers to the first request received by the server, but the trace data is stored on the client side. To resolve this, remove the Client condition or avoid selecting Service Entry Span.

Solution for Root Cause 1

  • Appropriately increase the sampling rate according to requirements.
  • Use richer sampling methods, such as tail sampling.

Solution for Root Cause 2

Adjust the refresh interval through the --es.asm.index-refresh-interval startup parameter of jaeger-collector, with a default value of 10s.

If the value of this parameter is "null", there will be no configuration for the index's refresh_interval.

Note: Setting it to "null" will affect the performance and query speed of Elasticsearch.

Solution for Root Cause 3

When using the Span kind parameter in trace queries:

  • Avoid using both Client and Server conditions simultaneously
  • For Root Span queries:
    • Remove the Server condition if both client and server are present
    • Or avoid selecting Root Span if you need server-side data
  • For Service Entry Span queries:
    • Remove the Client condition if both client and server are present
    • Or avoid selecting Service Entry Span if you need client-side data