Visualizing and Analyzing Unstructured Datasets with RepoViz

September 12, 2024
1:00 pm

Learn about RepoViz. Unstructured data exploration with powerful EDA tools like D-Tale, SweetViz, and YData, enabling faster insights, quality checks, and visualizations for images, audio, and text datasets

[[{“value”:”

Introduction

Understanding your data is crucial for effective model development. While exploring tabular data is often straightforward, visualizing unstructured data like images, audio, and documents can be challenging.

We built RepoViz to help explore unstructured data within your DagsHub repositories, making it easier to gain insights and drive your projects forward.

The Challenge of Unstructured Data in Exploratory Data Analysis

Exploratory Data Analysis (EDA) is crucial in any data science project. It helps uncover patterns, detect anomalies, and guide further analysis. For structured datasets, numerous tools exist to facilitate this process. However, when it comes to unstructured data like images, audio, or text documents, EDA becomes significantly more challenging.

Unlike tabular data, unstructured data doesn’t neatly fit into rows and columns. This makes it difficult to apply traditional statistical methods or create standard visualizations. How do you summarize a collection of images? How do you detect outliers in audio files? These challenges often lead to time-consuming manual inspection or the development of custom scripts, slowing down the entire machine learning pipeline.

Introducing RepoViz

RepoViz is designed to bridge this gap, offering a comprehensive suite of tools for visualizing and analyzing unstructured datasets.

With RepoViz, you can:

Easily connect to your DagsHub repositoriesSelect and load unstructured datasetsGenerate insightful visualizations and reports using powerful EDA tools

You can use 3 simple steps to visualize and analyze your data with RepoViz:

Upload your data to DagsHub, or connect an existing bucket.Set up RepoViz using the guide belowSelect your dataset in the UI – And enjoy the visualizations

RepoViz includes integrations with D-Tale, SweetViz, and YData Profiling. By using multiple EDA tools, you can leverage their unique capabilities to get a more thorough understanding of your data. For instance:

D-Tale provides interactive exploration and real-time data manipulation.SweetViz offers quick, automated comparisons between datasets.YData delivers detailed statistical summaries and data quality assessments.

Setting up RepoViz

Lets walk through how to install RepoViz. For the most up to date instructions, be sure to follow the instructions on our GitHub page.

First, download the code for RepoViz from GitHub in your working directory.

git clone https://github.com/zndr27/repo_viz.git

Next, create a virtual environment and install the dependencies from the pyproject.toml file using your favorite package management tools.

For example, you can use pdm and pyenv as follows:

cd path/to/repo-viz
pyenv install 3.12.4
pdm use 3.12.4
pdm install

Finally, use pip to install the DagsHub client in the virtual environment. This is necessary due to a dependency version conflict with YData Profiling.

cd path/to/repo-viz
eval $(pdm venv activate)
python -m pip install –upgrade dagshub

To run RepoViz, you’ll first need to start the D-Tale server in your terminal:

python path/to/repo-viz/dtale_app.py

Then, start the RepoViz server in another terminal session:

python path/to/repo-viz/app.py

Finally, open your browser and navigate to http://localhost:8051/. If you need to use a different port, you can specify it as follows.

python path/to/repo-viz/app.py –port 8051

How It Works

When the app starts you’ll be prompted to provide an access token.

Our application leverages the DagsHub API to access data directly from your repositories. Once your repository is specified, you can select the data sources you wish to visualize.

Search for your repository, then select a data source from the dropdown menu.

Currently, RepoViz uses metadata from unstructured data to create a dataframe that the three EDA tools can analyze. We highlight each of them below.

D-Tale

Overview: D-Tale is an interactive, web-based EDA tool that allows users to explore and manipulate dataframes and create custom visualizations.

Pros:

Interactive Exploration: D-Tale provides a highly interactive environment where users can sort, filter, and visualize data in real-time.Data Manipulation: D-Tale allows for data editing, adding new columns, and other manipulations.Custom Visualizations: Offers a wide range of interactive visualizations, such as scatter plots, heatmaps, and correlation matrices.Export Findings: Custom visualizations and filtered views can be exported in several formats (PNG, CSV, Excel, JSON, Parquet, Feather, SQL, and code).

Cons:

Learning Curve: The interface contains many detailed features that take time to learn how to use effectively.Performance: Because it’s highly interactive, D-Tale can sometimes operate slowly for larger datasets.Limited Automation: Unlike SweetViz and YData, which automatically generate reports, D-Tale requires user interaction in order to obtain insights.

For interactive, real-time data exploration and iterative analysis, D-Tale is the best choice due to its dynamic GUI and real-time capabilities.

D-Tale allows you to explore, modify, and highlight the dataframe.In the “Charts” window, you can create custom plots based on Plotly. They may be exported as images, or as generated scripts that can be run to replicate them.

SweetViz

Overview: SweetViz automates the EDA process by generating an HTML report that compares datasets, or analyzes them individually, with respect to (an optional) target variable.

Pros:

Automated Reports: Creates a comprehensive report that includes essential statistics, missing value analysis, and visualizations for distributions and correlations.Comparison Feature: SweetViz was designed for comparing feature distributions, such as in training/test splits or different sub-populations.Target Variables: SweetViz allows you to track how a target variable (boolean or numerical) varies with respect to each feature distribution.

Cons:

Static Reports: Users are unable to interactively modify the visualizations.Low Customization: SweetViz uses a fixed plot type for each data type (e.g. histograms for numerical, bar plots for categorical).Single Dataset: Without specifying a target variable or comparison dataset, SweetViz provides fewer details compared to YData and D-Tale.

Sweetviz is the optimal tool for quickly visualizing target variables and comparing datasets.

Here we see an example of a medical imaging dataset divided into two cohorts according to scanner type, and a target variable indicating a specific disease phenotype.

YData Profiling

Overview: YData automates the EDA process by producing detailed HTML reports that describe the dataset’s structure, correlations, and potential data quality issues.

Pros:

Comprehensive Reports: Automatically produces detailed reports that cover a wide range of statistics, correlations, and visualizations.Data Quality Checks: Includes checks for missing data, duplicates, unbalanced distributions, and other common issues that are important for data cleaning.Customization: Includes options for filtering data from the report, plot types, custom metrics, and themes.

Cons:

Static Reports: Users are unable to interactively modify the visualizations.Comparing Datasets: While YData provides the option to compare results from two datasets side-by-side, it is not as streamlined as SweetViz’s presentation.

YData is best suited for users who need a comprehensive, in-depth report on their datasets particularly when assessing data quality and preparing data for further analysis.

The “Overview” tab allows you to quickly gather insights about your datasets.YData generates summary statistics, quality checks, and visualizations for each feature.

Comparison of EDA Tools

D-Tale, SweetViz, and YData Profiling are all powerful tools for EDA, but they each have unique strengths and are suited to different scenarios. Here’s a comparison of the three:

D-Tale

SweetViz

YData Profiling

Complexity

Highly complex GUI that allows you to modify and filter data in real time.

Simple reports with minimal customization options.

Moderately complex reports with more customization options.

Customization

Custom filters and visualizations in real-time.

Fixed visualizations. Custom styling.

Custom styling, plot types, and metrics.

Interactivity

Interactive, real-time updates.

Static report.

Output Formats

Export filtered data, visualizations, and code (HTML, PNG, CSV, JSON, TXT).

HTML report.

Performance

Slower performance for larger datasets.

Report generation is slower for larger datasets.

Performance options to speed up report generation for larger datasets.

Comparing Datasets

Interactive filtering/comparisons.

Ability to compare datasets and select target variables.

Ability to compare datasets.

Data Quality Assessment

Interactive quality checks.

Basic data quality checks.

Extensive quality checks and insights.

With these differences in mind, here are our recommendations for each tool’s best use case:

Choose YData for your initial pass of the data in order to get detailed insights and quality assessments. Compared to the other tools, YData is the most robust option for generating a comprehensive report without the need for additional inputs.

Use SweetViz to quickly compare data subsets, particularly with respect to a target variable. While the other tools have the capability for side-by-side comparisons, SweetViz has the cleanest implementation as it was designed for the purpose.

For interactive data exploration and iterative analysis, use D-Tale. D-Tale is likely where you will spend most of your time, as it is powerful at gathering specific insights, but requires more effort compared to the other tools.

Future Additions

In future updates to RepoViz, we plan to introduce more options for report generation and new ways to extract features from unstructured data:

Customizable reports for SweetViz and YData

Specify data subsets for comparison
Specify target variables (SweetViz only)
Theme selection

Automatic feature extraction

Embedding models (e.g. image, audio, text, etc.)
Features generated from Label Studio annotations
Clustering and anomaly detection

New visualizations

Clustering visualization (e.g. t-SNE, UMAP)
Sample original data (e.g. images) from filtered views

We’ve made RepoViz open-source to better meet the community’s needs. If there’s a feature you’d love to see, feel free to create an issue or submit a pull request on our GitHub page. All contributions are welcome! 😊

Conclusion

DagsHub is a platform for versioning, sharing, and collaborating on machine learning projects. While DagsHub excels at managing data and workflows, we saw an opportunity to boost its capabilities for unstructured data analysis with RepoViz.

As the complexity of unstructured data continues to grow, tools like RepoViz become increasingly crucial in order to:

Provide quick insightsIdentify quality issues earlyMake informed decisions on data pre-processingAccelerate model development

RepoViz automates much of the EDA process, saving you time and providing consistent visualizations across teams. You can also export custom visuals and filtered data easily.

Try RepoViz out and see how it can elevate your exploratory data analysis for unstructured datasets. We’re eager to hear your feedback to make it even more powerful and tailored to your specific needs!

“}]]

Visualizing and Analyzing Unstructured Datasets with RepoViz

Introduction

The Challenge of Unstructured Data in Exploratory Data Analysis

Introducing RepoViz

Setting up RepoViz

How It Works

D-Tale

SweetViz

YData Profiling

Comparison of EDA Tools

Future Additions

Conclusion

Related Posts

Recent Events