# Data Deduplicator (`parsebird/dataset-deduplicator`) Actor

Merge and deduplicate Apify datasets by any field combination. Remove duplicate rows while keeping the first or last occurrence. Supports case-insensitive matching and whitespace trimming.

- **URL**: https://apify.com/parsebird/dataset-deduplicator.md
- **Developed by:** [ParseBird](https://apify.com/parsebird) (community)
- **Categories:** AI, Automation, Developer tools
- **Stats:** 4 total users, 2 monthly users, 100.0% runs succeeded, 1 bookmarks
- **User rating**: No ratings yet

## Pricing

from $1.49 / 1,000 items processeds

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### Data Deduplicator

Merge and deduplicate Apify datasets by any field combination. Remove duplicate rows automatically with case-insensitive matching and whitespace trimming built in.

<table><tr>
<td style="border-left:4px solid #2563EB;padding:12px 16px;font-weight:600">Combine multiple Apify datasets and remove duplicates by URL, email, name + company, or any field combination. Case-insensitive matching and whitespace trimming built in.</td>
</tr></table>

<br>

<table>
<tr>
<td colspan="3" style="padding:10px 14px;background:#2563EB;border:none;border-radius:4px 4px 0 0">
<span style="color:#FFFFFF;font-size:14px;font-weight:700;letter-spacing:0.5px">ParseBird Infra Suite</span>
<span style="color:#BFDBFE;font-size:13px">&nbsp;&nbsp;&bull;&nbsp;&nbsp;Utility tools for data pipelines</span>
</td>
</tr>
<tr>
<td style="padding:10px 14px;border:1px solid #E7E5E4;border-radius:0 0 0 4px;border-right:none;border-top:none;vertical-align:top;width:33%">
&#128279; &nbsp;<a href="https://apify.com/parsebird/http-request-actor" style="color:#1C1917;text-decoration:none;font-weight:700;font-size:13px">HTTP Request</a><br>
<span style="color:#78716C;font-size:11px">Send API calls from the cloud</span>
</td>
<td style="padding:10px 14px;border:1px solid #E7E5E4;border-right:none;border-top:none;vertical-align:top;width:33%;background:#DBEAFE">
&#128218; &nbsp;<a href="https://apify.com/parsebird/dataset-deduplicator" style="color:#2563EB;text-decoration:none;font-weight:700;font-size:13px">Data Deduplicator</a><br>
<span style="color:#2563EB;font-size:11px;font-weight:600">&#10148; You are here</span>
</td>
<td style="padding:10px 14px;border:1px solid #E7E5E4;border-radius:0 0 4px 0;border-top:none;vertical-align:top;width:33%">
&#128481; &nbsp;<a href="https://apify.com/parsebird/data-cleaner" style="color:#1C1917;text-decoration:none;font-weight:700;font-size:13px">Data Cleaner</a><br>
<span style="color:#78716C;font-size:11px">Clean nulls, normalize case, format phones & emails</span>
</td>
</tr>
</table>

##### Copy to your AI assistant

Copy this block into ChatGPT, Claude, Cursor, or any LLM to start using this actor.

````

parsebird/dataset-deduplicator on Apify. Call: ApifyClient("TOKEN").actor("parsebird/dataset-deduplicator").call(run\_input={...}), then client.dataset(run\["defaultDatasetId"]).list\_items().items for deduplicated results. Key inputs: datasetIds (array of strings — Apify dataset IDs to merge), jsonData (array of objects — direct JSON input, alternative to datasetIds), fields (array of strings, required — field names for dedup key). Matching is case-insensitive with whitespace trimming. First occurrence is kept. Full actor spec: fetch build via GET https://api.apify.com/v2/acts/parsebird~dataset-deduplicator (Bearer TOKEN). Get token: https://console.apify.com/account/integrations

````

### What does Data Deduplicator do?

This Actor merges one or more Apify datasets and removes duplicate rows based on fields you specify. It's the fastest way to clean up scraped data before analysis or export.

- **Single-field dedup** — deduplicate by `url`, `email`, `phone`, or any single field
- **Composite key dedup** — combine multiple fields like `firstName` + `lastName` + `company` to identify unique records
- **Smart matching** — case-insensitive comparison with automatic whitespace trimming
- **Multi-dataset merge** — combine items from multiple dataset IDs before deduplication
- **Direct JSON input** — pass data directly as a JSON array instead of referencing datasets

### How to use it (6 steps)

1. **Run your scraper(s)** — collect data into one or more Apify datasets
2. **Copy the dataset ID(s)** — find them in the Apify Console under your run's Storage tab
3. **Choose your dedup fields** — pick the field(s) that uniquely identify each record
4. **Run this Actor** — pass the dataset IDs and field names as input
6. **Get clean data** — deduplicated items appear in the output dataset

### Input parameters

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `datasetIds` | string[] | No* | — | Apify dataset IDs to merge and deduplicate |
| `jsonData` | array | No* | — | Direct JSON array of objects to deduplicate |
| `fields` | string[] | **Yes** | — | Field names for the dedup key |

*Provide either `datasetIds` or `jsonData` (or both).

### Composite key examples

| Use case | Fields | Effect |
|----------|--------|--------|
| Unique URLs | `["url"]` | One row per URL |
| Unique emails | `["email"]` | One row per email address |
| Unique people | `["firstName", "lastName", "company"]` | One row per person at each company |
| Unique products | `["sku", "marketplace"]` | One row per SKU per marketplace |

### Output example

Deduplicated items retain their original structure — no fields are added or removed:

```json
[
    {"name": "John Doe", "email": "john@example.com", "company": "Acme"},
    {"name": "Jane Smith", "email": "jane@example.com", "company": "Beta"},
    {"name": "Bob Wilson", "email": "bob@example.com", "company": "Gamma"}
]
````

A `stats` key is stored in the key-value store:

```json
{
    "totalLoaded": 5000,
    "uniqueKept": 3200,
    "duplicatesRemoved": 1800,
    "datasetsProcessed": 3
}
```

### How to use via API

**Python**

```python
from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("parsebird/dataset-deduplicator").call(run_input={
    "datasetIds": ["DATASET_ID_1", "DATASET_ID_2"],
    "fields": ["email"],
})

items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Unique items: {len(items)}")
```

**Node.js**

```javascript
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('parsebird/dataset-deduplicator').call({
    datasetIds: ['DATASET_ID_1', 'DATASET_ID_2'],
    fields: ['firstName', 'lastName', 'company'],
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Unique items: ${items.length}`);
```

**cURL**

```bash
curl -X POST "https://api.apify.com/v2/acts/parsebird~dataset-deduplicator/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "datasetIds": ["DATASET_ID_1"],
    "fields": ["url"]
  }'
```

### Tips and best practices

- **Start with a single field** — `url` or `email` usually covers most use cases
- **Use composite keys carefully** — the more fields, the stricter the matching (fewer duplicates found)
- Matching is always case-insensitive with whitespace trimming — no configuration needed

### Pricing

This Actor uses a pay-per-event pricing model.

| Event | Price per event | Price per 1,000 |
|-------|----------------|-----------------|
| `items-processed` | $0.00149 | **$1.49** |

Charged per 1,000 items loaded (not per unique item). Platform compute costs are additional.

# Actor input Schema

## `datasetIds` (type: `array`):

One or more Apify dataset IDs to merge and deduplicate. Items from all datasets are combined before deduplication.

## `jsonData` (type: `array`):

Direct JSON array of objects to deduplicate. Use this instead of datasetIds if you want to pass data directly.

## `fields` (type: `array`):

Field names to use as the deduplication key. Rows with identical values across all specified fields are considered duplicates. Use a single field (e.g. 'url') or a combination (e.g. 'firstName' + 'lastName' + 'company'). Matching is case-insensitive and whitespace is trimmed automatically.

## Actor input object example

```json
{
  "jsonData": [
    {
      "name": "John Doe",
      "email": "john@example.com",
      "company": "Acme"
    },
    {
      "name": "Jane Smith",
      "email": "jane@example.com",
      "company": "Beta"
    },
    {
      "name": "John D.",
      "email": "john@example.com",
      "company": "Acme Inc"
    }
  ],
  "fields": [
    "email"
  ]
}
```

# Actor output Schema

## `dataset` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "jsonData": [
        {
            "name": "John Doe",
            "email": "john@example.com",
            "company": "Acme"
        },
        {
            "name": "Jane Smith",
            "email": "jane@example.com",
            "company": "Beta"
        },
        {
            "name": "John D.",
            "email": "john@example.com",
            "company": "Acme Inc"
        }
    ],
    "fields": [
        "email"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("parsebird/dataset-deduplicator").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "jsonData": [
        {
            "name": "John Doe",
            "email": "john@example.com",
            "company": "Acme",
        },
        {
            "name": "Jane Smith",
            "email": "jane@example.com",
            "company": "Beta",
        },
        {
            "name": "John D.",
            "email": "john@example.com",
            "company": "Acme Inc",
        },
    ],
    "fields": ["email"],
}

# Run the Actor and wait for it to finish
run = client.actor("parsebird/dataset-deduplicator").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "jsonData": [
    {
      "name": "John Doe",
      "email": "john@example.com",
      "company": "Acme"
    },
    {
      "name": "Jane Smith",
      "email": "jane@example.com",
      "company": "Beta"
    },
    {
      "name": "John D.",
      "email": "john@example.com",
      "company": "Acme Inc"
    }
  ],
  "fields": [
    "email"
  ]
}' |
apify call parsebird/dataset-deduplicator --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=parsebird/dataset-deduplicator",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Data Deduplicator",
        "description": "Merge and deduplicate Apify datasets by any field combination. Remove duplicate rows while keeping the first or last occurrence. Supports case-insensitive matching and whitespace trimming.",
        "version": "1.3",
        "x-build-id": "7q8cq6Ishg8AqXxva"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/parsebird~dataset-deduplicator/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-parsebird-dataset-deduplicator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/parsebird~dataset-deduplicator/runs": {
            "post": {
                "operationId": "runs-sync-parsebird-dataset-deduplicator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/parsebird~dataset-deduplicator/run-sync": {
            "post": {
                "operationId": "run-sync-parsebird-dataset-deduplicator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "fields"
                ],
                "properties": {
                    "datasetIds": {
                        "title": "Dataset IDs",
                        "type": "array",
                        "description": "One or more Apify dataset IDs to merge and deduplicate. Items from all datasets are combined before deduplication.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "jsonData": {
                        "title": "JSON Data",
                        "type": "array",
                        "description": "Direct JSON array of objects to deduplicate. Use this instead of datasetIds if you want to pass data directly."
                    },
                    "fields": {
                        "title": "Deduplication Fields",
                        "type": "array",
                        "description": "Field names to use as the deduplication key. Rows with identical values across all specified fields are considered duplicates. Use a single field (e.g. 'url') or a combination (e.g. 'firstName' + 'lastName' + 'company'). Matching is case-insensitive and whitespace is trimmed automatically.",
                        "items": {
                            "type": "string"
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
