Deduplication
- The RAG Service indexer may be configured to automatically detect duplicate articles and text chunks in the dataset. These duplicates are excluded during retrieval.
- The API response can include references to such duplicates if requested.
- When there are full-article duplicates, they are returned under
articleDuplicates. - When there are only text chunk duplicates, they are returned under
chunkDuplicates, which in turn refer toparagraphDuplicatesso that they can be matched to the source articles. - Full-article duplicates and text chunk duplicates never overlap. An article can be either a full duplicate or a partial duplicate, but not both at the same time.
Duplicate articles and text chunks are represented as objects similar to the main objects, but instead of an id field they have two fields: leaderId and duplicateId.
leaderIdrefers to the leader item from a group of duplicates (items that are mutual duplicates of each other). It is the item that was actually used when generating the answer and is referenced in the main part of the API response.duplicateIdis the ID of the duplicate item, which is normally hidden. There can be several objects in the response referring to the sameleaderId, but eachduplicateIdis unique.
See: ChunkDuplicate, ArticleDuplicate.