Skip to main content

Deduplication

  • The RAG Service indexer may be configured to automatically detect duplicate articles and text chunks in the dataset. These duplicates are excluded during retrieval.
  • The API response can include references to such duplicates if requested.
  • When there are full-article duplicates, they are returned under articleDuplicates.
  • When there are only text chunk duplicates, they are returned under chunkDuplicates, which in turn refer to paragraphDuplicates so that they can be matched to the source articles.
  • Full-article duplicates and text chunk duplicates never overlap. An article can be either a full duplicate or a partial duplicate, but not both at the same time.

Duplicate articles and text chunks are represented as objects similar to the main objects, but instead of an id field they have two fields: leaderId and duplicateId.

  • leaderId refers to the leader item from a group of duplicates (items that are mutual duplicates of each other). It is the item that was actually used when generating the answer and is referenced in the main part of the API response.
  • duplicateId is the ID of the duplicate item, which is normally hidden. There can be several objects in the response referring to the same leaderId, but each duplicateId is unique.

See: ChunkDuplicate, ArticleDuplicate.