Mastering Duplicate Data Management in Machine Learning for Optimal Model Performance

Learn how duplicate data affects machine learning models and uncover strategies to identify, analyze, and manage duplicate data effectively.​

In today’s data-driven world, machine learning practitioners often face a critical yet underappreciated challenge: duplicate data management.

A massive amount of diverse data powers today’s ML models. Though gathering massive datasets has become easier than ever, the presence of duplicate records can considerably impact their quality, performance and often lead to biased results. This article is an attempt to delve into how duplicate data can affect machine learning models, and how it impacts their accuracy and other performance metrics. We’ll try to uncover practical strategies to identify, analyze, and manage duplicate data effectively. You will find sections on managing duplicate data, best practices, current trends and so on. We hope you find this article thought-provoking!

Understanding Duplicate Data in the context of Machine Learning

Duplicate data can manifest across all data types consumed by machine learning models during training. Consequently, it can occur in text, images, audio, video, and tabular data. Remember, these duplicates do not benefit the models learning from this data. Broadly speaking, duplicate data can be categorized into exact duplicates, near duplicates and similar or paraphrased versions. Let us explore each modality in detail.

Textual data

It’s important for us to understand different types of duplicate datasets we might encounter in the text realm. Text duplicates commonly occur through web scraping of identical content across sites, social media reposts, copied articles, auto-generated content with slight variations, and translation pairs expressing the same meaning. These duplicates naturally emerge during large-scale data collection from multiple online sources.

Let’s take it one by one.

Exact duplicates: Consider these as nearly identical twins in the textual realm. Thus, exact duplicate text includes whitespace, capitalization, punctuation, etc. The main causes of exact text data duplicates are web scraping, content syndication, and social media sharing, where the same content appears on several platforms. Marketing materials, news stories, and business documents are frequently repeated or reposted, producing exact replicas.

Near Duplicates: These are textual data with minor variations, such as spelling errors, and often with added or removed stop words and punctuations. An example could be “A quick brown fox jumped” versus “The quick brown fox jumps. Nearly duplicate text often enters datasets through social media sharing, content aggregators, and news syndication. Subtle variations of the same material are typically produced by template-based contents and minor modifications.

Paraphrases: Different wording conveying the same meaning. For instance, “The fox is quick and brown.” and “The brown fox is fast.” could be considered paraphrases.

Artifacts due to data augmentation: In NLP processes, data augmentation techniques like back translation and synonym replacement can sometimes inadvertently introduce near duplicate data points. This data proliferates across websites, blogs, and social media primarily via automated content creation, SEO-optimized spun text, chatbot interactions, and similar systems.

Image data

Datasets naturally contain duplicate images due to several interrelated processes. Often, the same visual material is presented in many file formats, such as JPEG, PNG, or WebP, and image processing produces changes through colour, cropping, and resizing. Nearly identical duplicates are produced via digital manipulations including mirroring, rotation, and filter application. Duplicates are further increased by the extensive use of stock pictures on websites and screenshots that capture related content. Furthermore, by purposefully producing variations of the original photos with data augmentation techniques.

Exact duplicates: These can be pixel-to-pixel identical images. That means they are of the same size, resolution, etc. They are often created by unintentional data copying or ingestion errors.

Near duplicates: These images exhibit slight variations, such as being cropped, having additional watermarks, differing in resolution, or being panorama shots. An example, as shown below, could include images from a burst shot sequence.

Similar and Augmented Images: Image augmentation plays a crucial role in computer vision by artificially expanding the size and diversity of training datasets. Creating altered images, such as ones that show the same object or scene from many viewpoints or angles, is part of this process. When it comes to computer vision and machine learning models, using augmented images is really important. They help the models spot patterns and similarities in images that look alike. Deep neural networks, in particular, need a ton of training data to learn effectively and avoid overfitting. However, gathering enough training samples can be tough. That’s why not every image that looks almost the same is a problem—some can actually be beneficial for training. It’s important to consider the specific learning task when selecting augmentation techniques to ensure the enhanced data is relevant and useful. If you’re interested in learning more about image augmentation, you might want to check out this article.

By Applying transformations to original image, you could create large amount of new training samples: source

Audio data

Identical Duplicates: Similar to Textual and image duplicates, these are formed by duplication errors in storage or retrieval.

Near duplicates: Audio with minor variations such as different encoding formats (*.aac, *mp3), with added noise or varying bitrates. For example, the same song might be uploaded in multiple formats or with slight edits, leading to near duplicates. These variations can inadvertently be included in datasets, affecting model training by introducing redundancy and bias.

Similar Audio: Audio recordings of the same event or sound but with different microphone placements or background noise. For example, multiple recordings of the same concert from different locations are a classic example of similar audio.

Having explored the various types and potential sources of duplicate data across different modalities, let’s now delve into their impact.

Impact of duplicate data on model performance

Duplicate data often impact the model performance unless they are specially augmented ones to improve the model performance or increase minority class representation. The extent and nature of the impact depend on several factors, including the proportion of duplicates, the type of duplicates (exact or near), the learning algorithm used, and the specific use case. It’s imperative that the effects can be both negative and sometimes positive. Let’s look into potential issues caused by duplicate data.

Negative Effects

Biased Evaluation: This can primarily happen due to the presence of exact and near duplicate data in both training and validation sets, leading to inflated model performance metrics. Hence the model appears to perform better than it actually does on previously unseen data. On the contrary, if the model performs poorly on the duplicated data point, it can deflate the overall model performance metrics. This could effectively make the model appear less effective than it actually is, as repeated poor performance on the same data point disproportionately affects the overall evaluation.

Overfitting: Here essentially, the model becomes too specialized to the training set and fails to capture the underlying patterns needed to perform well on previously unseen data. This is especially true with exact duplicates and near duplicates. Exact or near duplicates in the training dataset could reinforce patterns that are not representative of general dataset characteristics. This often leads to redundancy, where models focus on memorizing these specific instances rather than learning the underlying patterns.

Training Costs: During pre/post processing, Computations with duplicate data create additional overheads during training without adding any real value to the model, which eventually increases training time and resource requirements.

Biased feature importance: Duplicate data can artificially skew the importance of certain features in the dataset. If a feature is consistently associated with duplicated instances, the model might assign it higher importance than it deserves, leading to incorrect feature selection or interpretation. One important thing to be noted here is that this is not always a bad outcome. We will discuss this matter later in upcoming sections.

Positive effects

As mentioned earlier, it’s not just the negative side! Some examples of positive effects are mentioned below.

Reinforcement of Important patterns: Repetition or the occurrence of duplicates/near-duplicates aids in the internalization of certain patterns by models, particularly large language models (LLMs) and code generation models, which results in more dependable and effective code generation. For example, the model learns best practices and design concepts more thoroughly when high-quality code examples, such as OOPS-based code structures and other design patterns(Oftentimes they are near-duplicates), are repeated throughout the training dataset.

Another example could be  multiple instances of proper API error handling patterns help the model consistently generate robust error handling code. So In effect, near-duplicates are reinforcing beneficial patterns into the model.

A deep dive into the effect of duplicate social media data can be found in the paper Xianming Li et al. 2024. This paper proposes a Generative AI based deduplication framework for detecting redundancy in social media data. They use a self-supervised generative model to predict key words, identifying and removing semantically duplicate texts by adding Gaussian noise to the data set during training, thus preventing the model from learning trivial features. Experiments on TweetEval datasets demonstrate the method’s effectiveness, with results showing reduced training samples and time while improving model performance.

How to identify duplicated data

Duplicate data identification strategies depend on the types of duplicate data being addressed. The approach slightly differs between exact duplicates and near duplicates.

Text Data

There are different prevalent techniques to identify duplicate text data.

Exact String Matching : This approach is for identifying perfectly matching text data. First we create unique hashes for each piece of text data. Then when identifying duplicates these hash keys are looked up. The advantage here is that, with hash based implementations, we can reduce the time complexity of deduplication efforts to O(n). A simple python implementation is shown below.

# Exact String Matching
def exact_match_deduplication(texts):
return list(set(texts))

If the likelihood of exact duplicates to be found in the training dataset is high, exact string matching algorithms can be used. But the time complexity of these algorithms tend to be of O(n2) or O(n)log(n). But Hash based implementation has O(n) complexity.

# Exact string matching using Hashing
from typing import List, Set, Dict
import hashlib

class ExactStringMatcher:
def __init__(self):
self.hash_table: Dict[str, str] = {}

def get_hash_string(self, text: str) -> str:
“””Generate SHA-256 hash for a string”””
return hashlib.sha256(text.encode(‘utf-8’)).hexdigest()

def find_duplicates(self, texts: List[str]) -> Dict[str, List[str]]:

“””Find exact duplicates using hash table with O(n) complexity”””
duplicates: Dict[str, List[str]] = {}

# Single pass through all strings – O(n)
for text in texts:
hash_value = self.get_hash_string(text)

if hash_value in self.hash_table:
if hash_value not in duplicates:
duplicates[hash_value] = [self.hash_table[hash_value]]
duplicates[hash_value].append(text)
else:
self.hash_table[hash_value] = text

return duplicates

# Usage
matcher = ExactStringMatcher()
# Sample text data
texts = [
“Hello World”,
“Python Programming”,
“Hello World”,
“Data Structures”,
“Python Programming”,
“Algorithms”
]

# Find duplicates
duplicates = matcher.find_duplicates(texts)

Fuzzy String Matching: This is another popular technique for finding duplicate text data. Fuzzy matching uses intelligent algorithms to detect similar words and phrases, going beyond exact matches. By adjusting matching parameters, you can identify duplicates that share strong similarities even when they’re not completely identical. This makes fuzzy matching a powerful tool for finding related content across data collections

There are several techniques within the fuzzy matching realm. A few of the popular ones are mentioned below.

Levenshtein Distance: It’s a metric for measuring the difference between two strings by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into the other.Hamming Distance: The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols differ.N-Gram based matching: It compares text documents by breaking them down into overlapping sequences of N contiguous words (N-Grams).BK-Tree (Using Levenshtein Distance and Triangle Inequality): It organizes strings based on their edit distance (Levenshtein distance), allowing quick searches for strings within a specific similarity threshold by recursively pruning search branches.

For more details, refer to this article. Below is a sample python code snippet demonstrating fuzzy matching using Levenshtein distance.

# Fuzzy String Matching using Levenshtein Distance
from fuzzywuzzy import fuzz
def fuzzy_match_deduplication(texts, threshold=90):
unique_text_list = []
for text in texts:
if not any(fuzz.ratio(text, t) > threshold for t in unique_text_list):
unique_text_list.append(text)
return unique_text_list
texts = [
“apple”,
“appel”,
“banana”,
“bananna”,
“orange”,
“oragne”
]
threshold = 80
unique_texts = fuzzy_match_deduplication(texts, threshold)

MinHash: Let us examine briefly what Jaccard similarity is about. The Jaccard similarity index is a function that compares sets, and characterizes the similarity of two finite sets as the ratio of their overlap to the size of their union

J(A, B)= |A∩B| / |A∪B|

For two disjoint sets, the numerator ∣A∩B∣ is zero, and the index goes to zero. But if the sets are identical, A∩B=A∪B=A=B, and the Jaccard similarity is 1.

​MinHash approximates Jaccard similarity using probabilistic techniques: It applies hash functions to each element in the text sets and keeps the minimum hash value. The probability that two sets have the same minimum hash value equals their Jaccard similarity. The above combination gives us fast and accurate near-duplicate detection, especially for large document collections. For in depth knowledge, please refer to this blog post.

A classic real-world use case is GPT-3 Training: 60% of GPT-3’s weighted pre-training dataset originates from a filtered version of Common Crawl, comprising 410 billion byte-pair-encoded tokens. Fuzzy deduplication for this dataset was performed using Apache Spark’s MinHashLSH. The model achieved better performance with 45TB of deduplicated data vs 100TB raw data, thus reducing training costs significantly

Vector Space Theory: This approach identifies near-duplicate texts based on the assumption that similar texts will lie close in their multidimensional vector space. It’s a highly popular technique in natural language processing where we transform words into dense vector representations in a high-dimensional space, where semantic similarities are captured by the spatial relationships between these vectors. Distance metrics for semantic similarity include cosine similarity, Euclidean distance, Manhattan distance and so on.

Clustering: Clustering can group texts using features like embedding vectors or TF-IDF vectors. Duplicate texts naturally tend to fall into the same clusters. Unsupervised algorithms like K-Means clustering, DBSCAN are prevalently used to create the text clusters.

# Function for Clustering-based deduplication
from sklearn.cluster import DBSCAN
def cluster_deduplication(vectors, eps=0.5):
“””
Perform clustering-based deduplication on vectors using DBSCAN

Args:
vectors: numpy array of vectors to deduplicate
eps: epsilon parameter for DBSCAN (default: 0.5)

Returns:
List of indices for unique vectors
“””
clustering = DBSCAN(eps=eps, min_samples=2).fit(vectors)
unique_indices = [i for i, label in enumerate(clustering.labels_)
if label == -1]
return unique_indices

# Run deduplication
unique_idx = cluster_deduplication(sample_vectors, eps=0.5)

# Print results
print(“Unique vectors indices:”, unique_idx)

Active Learning for Deduplication: Apart from the classic techniques mentioned above, recent research uses techniques like active learning, which is explored in the paper by Liu, Xinyao, et al. “A Pre-trained Deep Active Learning Model for Data Deduplication[Link]”. They present PDDM-AL, an architecture that approaches deduplication as a sequence classification problem and applies active learning for it. Prior to model training, data undergoes summarisation, blocking, serialization, and domain knowledge injection. Uncertain examples are chosen for expert labelling and then fed-back into the training dataset to undergo additional active learning iterations, while the trained model generates duplicate/non-duplicate predictions on unlabeled data.

Design diagram of PDDMAL

DagsHub makes active learning easier by combining tools like Data Engine, MLflow, and Label Studio. You can refer more to the workflow in this article. By choosing and labelling data points appropriate to model confidence ratings, users can iteratively improve models, optimizing resource utilization while improving model accuracy. The workflow also facilitates data management, model training, and annotation procedures.

Image Data

Deduplication techniques used in image datasets aren’t that different from what we do for text. Hashing, vector embeddings, clustering, etc are widely used in image deduplication as well. Removing duplicate images helps improve the quality of the dataset, reduces training time, and can enhance model performance by ensuring that the model learns from diverse examples. Let’s go into these techniques in a bit of detail below.

Hashing Technique

This technique generates a hash value based on the visual content of an image. Unlike cryptographic hashes, perceptual hashes are designed to be similar for visually similar images. And for the same reason, Perceptual Hashing (or pHashing) is the most popular.

A demonstration of how pHashes work is shown in the images below. The code snippet to generate an image hash is given below too.

Image by Author

The Hash binary representation of the final pattern(Shown in image above) is given below

1100100111001001101101100011110010111100100100110011010010010100

The python code to convert any image to its pHash representation is given below. For more information please refer to the paper from Ofcom. This paper provides an overview on the prevalent perceptual Hashing Techniques.

from PIL import Image
import imagehash
import matplotlib.pyplot as plt
import numpy as np

# Load the image
image_path = ‘path_to_your_image.jpg’
original_image = Image.open(image_path)

# Step 1: Convert to grayscale
gray_image = original_image.convert(‘L’)

# Step 2: Resize using LANCZOS
resized_image = gray_image.resize((32, 32), Image.Resampling.LANCZOS)

# Step 3: Compute pHash
phash = imagehash.phash(original_image)

# Step 4: Convert hash to binary array for visualization
# Convert hex hash to binary string and then to array
hash_binary = bin(int(str(phash), 16))[2:].zfill(64)
hash_array = np.array([int(bit) for bit in hash_binary]).reshape(8, 8)

# Display all steps
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
axes[0].imshow(original_image)
axes[0].set_title(‘Original Image’)
axes[0].axis(‘off’)

axes[1].imshow(gray_image, cmap=’gray’)
axes[1].set_title(‘Grayscale Image’)
axes[1].axis(‘off’)

axes[2].imshow(resized_image, cmap=’gray’)
axes[2].set_title(‘Resized Image (32×32)’)
axes[2].axis(‘off’)

axes[3].imshow(hash_array, cmap=’binary’)
axes[3].set_title(‘pHash Binary Pattern (8×8)’)
axes[3].axis(‘off’)

plt.tight_layout()
plt.show()
print(f”Perceptual Hash: {hash_binary}”)

Another group of hashing techniques decomposes the image into a set of waves using a Fourier transformation, whereby the darkness and brightness of pixels in the image are represented by the peaks and valleys of these waves as shown in the image below. For more information please refer to the source document.

A digital photograph on left and on the right, the image function and its Fourire transform of the same digital photograph is shown: source

CNNS(Deep Neural Networks) based Approach

Pre-trained CNNS eg(VGG, InceptionNET, ResNet) can be used to extract high level features from images. These features can then be compared to using metrics like Cosine-similarity and Euclidean distances and in turn used for deduplication efforts. Please refer to this article titled “Exploring Feature Extraction with CNNs” by Rodrigo Silva for more information on CNNs based feature extraction.

Visual representation of extracted features from Image using CNN: Images from Article & Pexels

Image Embeddings

CNNs and Autoencoders  can be effectively used for generating compact image embeddings. By comparing these embeddings using specific similarity metrics (mean squared error, structural similarity index etc.) you can identify similar or duplicate images.

Clustering Techniques (e.g., K-Means and DBSCAN):

These algorithms can be used to cluster similar images based on extracted features or embeddings. Duplicate images are typically found  in the same cluster.

Each of the above-mentioned techniques has its strengths and weaknesses, and the choice of method often depends on the specific requirements of the dataset, ML model, and the available computational resources. In practice, a combination of these techniques may be used to achieve robust image deduplication.

Audio data

Similar to Image and Text data, Audio file deduplication is an interesting area to work on. This can be challenging due to variations in file formats, bit rates, and metadata. Below are some common techniques used.

Audio fingerprinting

Audio fingerprinting is the most popular method for deduplication.

Theoretical basis: Creates a compact digital summary of audio signals based on their acoustic properties. Often uses perceptual hashing algorithms to generate unique fingerprints that remain stable despite minor modifications (This is similar to image pHashing technique). Please refer to this Article by Sergiu Ciumac for an in depth reference.

Spectral Analysis

By comparing the spectral distribution, such as using Fourier transforms, files with similar audio content can be identified even if they differ in other aspects, like with different bit rates or compression levels.

Spectral analysis of an audio file is shown below link to this audio file.

Image by Authorimport librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Load the MP3 file
file_path = ‘path/to/your/audiofile.mp3′
y, sr = librosa.load(file_path, sr=None)

# Compute the Short-Time Fourier Transform (STFT)
D = librosa.stft(y)

# Convert the amplitude to decibels
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

# Plot the spectrogram
plt.figure(figsize=(10, 6))
librosa.display.specshow(S_db, sr=sr, x_axis=’time’, y_axis=’log’, cmap=’viridis’)
plt.title(‘Spectrogram’)
plt.xlabel(‘Time’)
plt.ylabel(‘Frequency’)
plt.tight_layout()
plt.show()

Spectral analysis can be a powerful tool for duplicate audio detection because it provides a detailed representation of the frequency content of an audio file over time. After creating spectral analysis, a workflow to detect duplicate audio files can be made as mentioned below.

Load the audio files and compute their spectrogramsExtract relevant features from the spectrograms – examples could be Mel-Frequency cepstral coefficients(MFCCs), chroma features etc.Use similarity metrics(audio specific) or a trained model to compare the extracted featuresDecide whether to deduplicate them or keep based on the similarity thresholds.

Best Practices for Data Deduplication for ML model training

You need to know the nature and traits of your data sources and types before implementing any data deduplication techniques. Formats, standards, and quality levels might vary throughout data providers. It is possible for data from distinct sources to have distinct characteristics, values, and connections.

Though there aren’t a set of golden rules, here are some general best practices for developing deduplication pipelines. Few of them are listed below.

Data Validation: This is important for deduplication because it helps to establish baseline quality of data before commencing your deduplication processes. So ensure that the incoming data conforms to the expected data schema. Also ensure checks for data completeness, missing values and anomalies. Tools like Tensorflow Validation can help automate the processMaintain clear Documentation of data sources, transformations and processing steps. Use version control for datasets and preprocessing scripts. Manage the metadata collection and storage efficiently. It would help to improve the process in future by creating a clear audit trail of how duplicate records are identified and handled throughout the data pipeline.For Streaming data, use windowed deduplication techniques to identify duplicates within a specific time frame. Tools like Apache Kafka and Apache Flink can be configured for this purpose. Also, we can use event timestamps to accurately identify and remove duplicates in streaming data, ensuring that late-arriving data is handled correctly.Domain specific workflows: Leverage domain knowledge to define what constitutes a duplicate. Develop custom algorithms tailored for the datasets and the ML models to be trained. This might involve separate machine learning models used just for deduplication. Always use feedback loops and involvement of SMEs to refine and improve the deduplication pipelines. A classic example can be observed in the healthcare domain, where patient records might be considered duplicate even if they share the same name, id, date of birth etc, but have different addresses.

Future Directions and Emerging Technologies

Data deduplication is becoming more crucial as AI training datasets expand rapidly. Contemporary deduplication methods emphasize both efficiency and preserving data quality while eliminating redundancies. Some recent and upcoming trends in deduplication are discussed below.

Semantic Deduplication: Semantic deduplication using vector embeddings are becoming more prevalent in text and image deduplication workflows. This ensures preserving diverse examples while removing redundant variations

Real- Time deduplication: There are emerging trends like processing and deduplicating data on-the-fly especially using hashing and indexing techniques(discussed earlier), this is especially applicable for streaming data sources. They often use distributed computing and optimized data structures to ensure low-latency processing.

Multimodality: Techniques like cross-modal similarity detection and integration of diverse data formats etc are likely to become more prevalent. This shift towards multimodal ML models enables more comprehensive and nuanced understanding by leveraging information from various sources such as audio, images and text.

Using Machine Learning & Reinforcement Learning models: Self supervised and self improving deduplication models. These models often adaptively select the filtering thresholds.

Blockchain for Deduplication: Utilizing blockchain technology to maintain a decentralized and immutable record of data chunks, ensuring that duplicates are identified and managed across distributed systems.

Federated Learning: Deduplication can be applied effectively in a federated learning setup where data is distributed across multiple nodes. This ensures that deduplication happens without centralizing data.

Quality aware deduplication: This is an emerging deduplication technique where we selectively preserve high quality training samples. They are often specialized in automatic bias detection and mitigation.

Conclusion

As the title suggests, Mastering duplicate data management remains a key criteria for successful Machine Learning model developments.

Throughout this article we have been exploring deduplication techniques used for each data modalities. As you might have observed, from text and images to audio files, each data modality requires specialized deduplication techniques that balance efficiency with data quality.

Modern approaches provide us with robust frameworks for tackling this challenge across various domains and modalities. We have explored specifically techniques used for text, image and audio datasets.  You would observe some of the algorithms mentioned here overlapping across the modalities, especially between text & image data types, the reason being both dealing with pattern recognition and feature comparison. Also they use the same similarity metrics like cosine similarity & Euclidean distances.

As we live in an era where datasets are growing at an exponential rate, the whole deduplication efforts require highly adaptive and smart solutions that can preserve relevant and valuable training samples even while processing data in real time. We hope this article serves as a guide for effective data deduplication and building intelligent solutions, also to optimize their ML pipelines, reduce training costs, and ultimately build more reliable and performant models!

Related Posts

Recent Events

Scroll to Top