What is wallet clustering and why does it matter?

Wallet clustering is the process of grouping addresses that probably belong to the same entity into a single logical 'wallet.' It matters because surface-level on-chain analysis treats every address as independent, but in practice a single fund or trader often operates dozens of addresses across multiple chains. Without clustering, a top-100 holder list overstates distribution, whale-activity counts double-count the same actor, and concentration risk is mis-measured. Proper clustering compresses the apparent address universe into the actual decision-making unit count.

How do you trace funds across multiple wallets on Ethereum?

Five-step framework: (1) start with the seed address and pull its full transaction history from Etherscan or a block explorer; (2) identify direct outbound transfers — every wallet that received funds directly from the seed; (3) classify the receiving wallets as exchanges, contracts, or new individual wallets; (4) for each new individual wallet, repeat the analysis recursively for 1-2 hops; (5) apply heuristics — common deposit address, simultaneous activity windows, identical funding source — to filter the graph down to wallets that probably share the same owner. The output is a probabilistic cluster, not a definitive identification.

What are the main heuristics used in wallet clustering?

Four primary heuristics: (1) common-input — multiple inputs to a single transaction probably share an owner (this is the foundational Bitcoin-era heuristic, weaker on Ethereum because account-based UTXO logic differs); (2) deposit-address reuse — wallets that all deposit to the same exchange address with the same memo or pattern often share an owner; (3) timing-correlation — wallets that consistently transact within seconds of each other across multiple unrelated tokens are likely controlled by the same automation or operator; (4) gas-funding pattern — wallets that all received their initial gas funding from the same source within a narrow time window are likely a sybil cluster. None of these are individually conclusive; combining them produces a confidence score.

Can wallet clustering be defeated?

Partially. Mixing services (Tornado Cash before its sanctions, privacy chains like Monero or Zcash, cross-chain bridges with sufficient anonymity sets) can break direct on-chain links. Cross-exchange routing — depositing to one CEX and withdrawing to a fresh wallet from a different CEX — can also break clustering for the routed portion of funds. However, comprehensive defeat is hard because most operational use of crypto requires interacting with public infrastructure (DEX routers, CEX deposit addresses, identifiable contract patterns) that creates new clustering signals. The honest read is that determined privacy-seekers can break individual links, but consistent operational privacy across years of activity is operationally expensive and rare.

What are the limits of on-chain forensics?

Three structural limits. First, on-chain data only — any analysis stops at the wrapper of CEX deposit addresses, custodial vaults, OTC desks, and bridge contracts where public visibility ends. Second, identity is inferred, never confirmed: clustering produces probabilistic groups, not identities. Third, the graph grows fast — a wallet 3-4 hops away from a seed often has thousands of indirect connections, most of which are noise. Effective forensics requires good filters as much as good heuristics, and the result is always a confidence-weighted hypothesis rather than a definitive map.

On-Chain Methodology

On-Chain Forensics & Wallet Clustering: Trace Funds Across Wallets [2026 Guide]

How to trace crypto fund flows across multiple wallets and cluster related addresses in about 45 minutes using free public block-explorer data. Covers the four primary clustering heuristics (common deposit address, timing correlation, gas-funding pattern, behavioral signature) and the 5-step methodology DBA uses internally.

5-step

Methodology

Core Heuristics

~45 min

Per Investigation

Free

Public Data Only

Published 2026-05-05 · Deep Blue Alpha

Not Financial Advice. This article is on-chain methodology for tracing crypto fund flows and clustering wallets — not a trading recommendation, not a privacy-defeat tool, and not legal advice. Wallet clustering produces probabilistic groupings, never definitive identities. Always do your own independent research before drawing conclusions about any specific wallet or actor.

Quick Answer · TL;DR

On-chain forensics is the practice of tracing crypto fund flows across multiple wallets and clustering related addresses to identify the underlying decision-making units. Most surface-level on-chain analysis treats every address as independent — in reality, a single fund or trader often operates dozens of addresses, which makes naive holder counts, whale-activity tallies, and concentration ratios systematically misleading.

The 5-step methodology: (1) pull the seed address transaction history; (2) identify and classify direct outbound transfers; (3) apply the four primary clustering heuristics (common deposit address, timing correlation, gas-funding pattern, behavioral signature); (4) recurse 1-2 hops outward; (5) validate against external labels and write up the cluster with confidence scores.

Free public block-explorer data is sufficient for most investigations. Roughly 45 minutes per investigation depending on cluster size. Output is always a probabilistic hypothesis, never a definitive identity.

The most common mistake in on-chain analysis is treating every address as a separate actor. A naive top-100 holder list of an ERC-20 token might show 100 distinct wallets, but if 30 of them are operated by the same fund using a fragmented address strategy, the actual top decision-making units are closer to 70 — and the structural concentration is materially higher than the headline. Without wallet clustering, on-chain research systematically over-counts distribution, under-counts whale concentration, and double-counts the same actor's activity across multiple addresses.

This is the methodology Deep Blue Alpha uses internally to resolve address-level ambiguity into actor-level groupings. The full process takes about 45 minutes per investigation using free, publicly verifiable block-explorer data. We will walk through the framework, the four primary heuristics that drive the clustering decision, the structural limits of what on-chain forensics can and cannot resolve, and the honest probability framing that makes the output usable rather than misleading.

What is on-chain forensics?

On-chain forensics is the practice of tracing crypto fund flows across multiple wallets, identifying which addresses likely belong to the same entity, and reconstructing the sequence of transactions that connect them. It uses public blockchain data only — no off-chain identification — and combines transaction-graph analysis, deposit-pattern matching, behavioral clustering, and timing correlation to map relationships between addresses.

The discipline grew out of Bitcoin's UTXO model, where the foundational "common-input" heuristic produced relatively reliable address clustering: if multiple inputs were spent in a single transaction, they almost certainly shared an owner. Ethereum's account-based model breaks that heuristic, but adds new ones — gas-funding patterns, transaction sequencing, and contract interaction signatures — that compensate. The result is that modern on-chain forensics is multi-heuristic rather than relying on any single signal.

The output of forensics is fundamentally probabilistic. A cluster is not a definitive identification; it is a confidence-weighted hypothesis that several addresses share a common owner or operator. The honest framing is "probably, with confidence X" rather than "is" — and analysis that ignores this distinction is the most common source of false attribution in the space.

Why wallet clustering matters for whale tracking

Three concrete consequences of failing to cluster:

Holder distributions are over-stated. A token's top-100 holder list might show 100 distinct addresses but represent only 60-80 actual decision-making units. Naive concentration metrics treat 100 addresses as 100 actors; clustering reveals that the structural concentration is materially higher than the headline. We covered this directly in the whale concentration risk methodology — the active-tradable concentration metric depends on accurate wallet classification, which depends on clustering.

Whale activity counts double-count. A whale-tracking system that fires alerts on every distinct address treats one fund using ten addresses as ten whales. Aggregate whale-flow numbers inflate; multi-wallet convergence signals fire spuriously when the same actor's activity gets counted multiple times. Proper clustering compresses the address universe to the actor universe.

Smart-money cohort analysis loses signal. Cohorts of "smart money" wallets often look broadly distributed when they're actually concentrated in a small number of multi-address operators. Without clustering, a cohort tracker might report "127 smart-money wallets bought $TICKER" when the reality is "9 distinct operators, controlling 127 addresses, bought $TICKER." Different signal, different interpretation.

The clustering principle: the right unit of analysis is the decision-maker, not the address. An address is a credential; a decision-maker is an actor. Address-level metrics over-count actor-level activity in proportion to how aggressively each actor fragments their on-chain footprint.

The four primary clustering heuristics

No single signal is conclusive. Effective clustering combines multiple heuristics into a confidence score. Below are the four primary heuristics on Ethereum-style account-based chains, ranked by reliability.

Heuristic 1 — Common deposit address Strong

Two wallets that consistently deposit funds to the same exchange address — particularly when the deposits use the same memo, similar amounts, or near-identical timing — are almost certainly controlled by the same entity. Exchange deposit addresses are typically unique per user, so multiple wallets routing to the same address is a high-confidence linkage signal. Reliability is highest when the deposits span months and are not co-incidentally similar.

Heuristic 2 — Timing correlation Medium-Strong

Two wallets that consistently transact within seconds of each other, across multiple unrelated tokens, over a sustained period, are likely controlled by the same operator (or by the same automation script). Single coincidences are weak signal; sustained patterns across 20+ events are strong signal. The reliability depends heavily on volume — high-frequency wallets often share timing windows with unrelated wallets by chance, requiring tighter correlation thresholds.

Heuristic 3 — Gas-funding pattern Medium-Strong

Wallets that received their initial gas funding from the same source within a narrow time window are likely a coordinated cluster. This heuristic is especially powerful for sybil detection — when 50 wallets all received their first 0.01 ETH from the same wallet within a 30-minute window, the gas-funding signature is a near-conclusive indicator of common operation. Standalone wallets receiving gas from common services (CEX withdrawals, faucets) are weaker signal; wallets receiving gas from a single private address are stronger.

Heuristic 4 — Behavioral signature Medium

Two wallets that consistently exhibit the same trading patterns — same DEX router, same time-of-day activity, same token rotation logic, same gas-bidding strategy, same MEV-resistance choices — are likely controlled by the same operator. Behavioral signatures are weakest of the four primary heuristics because behavior can converge by chance among independent traders following the same playbook. They become strong when combined with one of the other three heuristics, particularly heuristic 1 (common deposit address).

Beyond these four, several secondary heuristics fill in edge cases: ENS domain reuse (two wallets resolving from the same ENS subdomain structure), contract-interaction patterns (wallets that interact with the same private smart contracts), and L2-bridging patterns (wallets that bridge to the same L2 destinations using the same routes). These are useful supplements but rarely conclusive on their own.

Heuristic reliability comparison

Heuristic	Strength	Best use case	Main failure mode
Common deposit address	Strong	Sustained CEX deposits with shared memo or amount pattern	Shared exchange deposit infra (rare)
Timing correlation	Medium-Strong	20+ near-simultaneous events across unrelated tokens	High-frequency wallets share windows by chance
Gas-funding pattern	Medium-Strong	Sybil detection on airdrop hunters and bot clusters	Common-source gas (CEX withdrawals, faucets)
Behavioral signature	Medium	Distinguishing fund operators from retail traders	Independent traders converging on same playbook
ENS / domain reuse	Medium	Identified-individual cluster expansion	Sharing or transfer of ENS ownership
Contract-interaction patterns	Weak alone	Strengthening other signals	Public contracts have many users

The 5-step methodology to trace funds across multiple wallets

Below is the full process. The structured version is also available as HowTo schema on this page.

Step 1 — Pull the seed address transaction history

Start with the address you want to investigate. Open it on Etherscan (or the relevant chain's block explorer — Arbiscan for Arbitrum, BaseScan for Base, BscScan for BSC, Polygonscan for Polygon) and pull the full transaction history. Record the funding source (the first inbound transaction, which usually identifies whether the wallet was funded from a CEX, a bridge, or another EOA), the most active counterparties (wallets it transacted with most frequently), and any contract addresses interacted with. This is the root of the investigation. Allocate ~5-10 minutes for this step on an active wallet.

Step 2 — Identify and classify direct outbound transfers

List every address that received funds directly from the seed. For each, classify the wallet type into one of these categories:

Known CEX — Etherscan-tagged exchange deposit addresses (Binance, Coinbase, Kraken, Bybit, OKX, Crypto.com, etc.). These are dead-ends for direct clustering — the deposit goes into pooled custody, not back to a specific person.
DAO treasury or vesting contract — structured allocations, not personal wallets. Dead-end.
Bridge contract — the funds moved cross-chain. Investigate continues on the destination chain if relevant.
Named smart contract — DEX router (Uniswap, 1inch, CoW Swap), lending pool (Aave, Compound), staking contract. The seed used the contract; doesn't link to a specific cluster member, but the contract interaction may reveal behavioral signatures.
Fresh EOA — another externally-owned address, possibly belonging to the same actor. This is the cluster candidate.

Etherscan tags many of these directly. For unlabeled addresses, the transaction history reveals the classification: a wallet that only ever transacts on a fixed schedule with the same counterparty is probably a vesting contract; a wallet whose interactions are dominated by router calls is an active trader; a wallet that received a large incoming transfer and never moved it is probably a long-term holding wallet.

Step 3 — Apply the four primary clustering heuristics

For each candidate EOA from step 2, score it against the four heuristics:

Common deposit address (0-3 points): 0 if no overlap; 3 if both wallets deposit to the same CEX address with consistent pattern over months
Timing correlation (0-3 points): 0 if no correlation; 3 if 20+ events within tight time windows across unrelated tokens
Gas-funding pattern (0-3 points): 0 if independent funding; 3 if both received initial gas from the same private source within minutes
Behavioral signature (0-3 points): 0 if patterns diverge; 3 if patterns are distinctively shared (same DEX, same time-of-day, same token rotation logic)

Total score ranges 0-12. A candidate that scores 8+ is high-confidence cluster member. 5-7 is medium-confidence (worth continued investigation). Below 5 is low-confidence (probably independent). The thresholds are calibrated for typical cases — high-frequency or anonymity-conscious wallets may need adjusted thresholds.

Step 4 — Recurse 1-2 hops outward to expand the cluster

For each high-confidence cluster member from step 3, repeat steps 1-3: pull its transaction history, identify direct outbound transfers, classify recipients, and score new candidates. Stop at 2 hops from the seed unless a specific lead justifies going further.

The reason to limit to 2 hops: beyond 2 hops, the noise grows faster than the signal. A wallet 3 hops from the seed often has thousands of indirect connections, most of which are coincidental rather than meaningful. The exception is when a specific 3-hop path produces extraordinary signal — for example, a chain of common-deposit-address links across 3 hops — in which case the path is worth following. But "investigate everything 3 hops out" is rarely productive.

Maintain a confidence score for every cluster member. The cluster's core (high-confidence, 8+ score) and periphery (medium-confidence, 5-7 score) should be tracked separately. Periphery members are useful for hypothesis generation but should not be treated as confirmed cluster members.

Step 5 — Validate against external data and write up the cluster

Cross-reference the cluster against known wallet labels:

Etherscan tags — do any cluster members have public labels (fund names, exchange tags, identified individuals)?
Nansen and Arkham labels — if accessible, third-party label databases. Treat these as inputs, not ground truth; they have their own clustering methodologies and known false positives.
Lookonchain investigations — published curated investigations often cluster wallets to named entities.
ENS domains — do cluster members resolve from ENS domains that suggest a common owner?
Twitter / public disclosures — many fund managers and individuals have publicly disclosed at least one of their wallets, which propagates the label probabilistically across the cluster.

Write up the cluster with: core members (high-confidence, score 8+), peripheral members (medium-confidence, 5-7), excluded candidates (analyzed but rejected, with reason), and the heuristic chain that produced the conclusion. The write-up is what makes the analysis reproducible — without it, the work is impossible for anyone else to verify or correct.

5-step on-chain forensics methodology summary

Step	Tool	Output	Time
1. Seed address history	Etherscan / chain explorer	Funding source, top counterparties, contract interactions	~5-10 min
2. Outbound classification	Etherscan tags + manual review	Wallet-type breakdown, list of EOA candidates	~10 min
3. Heuristic scoring	Etherscan tx history + DBA wallet behavior	0-12 confidence score per candidate	~10-15 min
4. 1-2 hop recursion	Repeat steps 1-3 on cluster members	Expanded cluster with confidence tiers	~10-15 min
5. External validation + writeup	Etherscan/Nansen/Arkham labels, ENS, Twitter	Final cluster report with confidence breakdown	~5 min

Defeating clustering: what privacy-conscious wallets actually do

Wallet clustering is partially defeasible, and it's worth being explicit about what defeat looks like in practice so the analysis stays honest.

Mixing services — Tornado Cash before its OFAC sanctions and certain decentralized mixers since — break the direct on-chain link between deposit and withdrawal. Funds enter a mixing pool from one wallet and exit from another, and the link is preserved only at the pool level. Comprehensive mixing of all flow can defeat heuristics 1-3 entirely.

Privacy chains — Monero, Zcash shielded transactions, and similar — provide native cryptographic privacy that on-chain forensics cannot penetrate without breaking the underlying cryptography. Funds that enter and exit privacy chains are effectively delinked.

Cross-exchange routing — depositing funds to one CEX, withdrawing them to a fresh wallet from a different CEX, repeats the process — can break clustering for the routed portion of funds, since CEX internal transfers are off-chain. This is operationally expensive and visible in aggregate (the source wallet shows a CEX deposit; the destination shows a CEX withdrawal) but the link between source and destination is broken at the CEX layer.

Air-gapped wallet operation — using completely disjoint wallets, funded from disjoint sources, operated on disjoint infrastructure, with disjoint timing — defeats all four primary heuristics because there are no shared signals. This is the most expensive defeat and is rare in practice; it requires the actor to maintain multiple unconnected operational stacks indefinitely.

The honest read is that determined privacy-seekers can break individual clustering links, but consistent operational privacy across years of activity is operationally expensive. Most wallets that appear "uncluster-able" are actually one mistake away from being clustered — a single shared CEX deposit, a single shared funding source, a single shared ENS domain. The defeat is rarely complete; it is usually partial and time-limited.

The honest limits: what on-chain forensics cannot tell you

Three structural limits apply to all on-chain forensics work, regardless of methodology quality.

On-chain only. Any analysis stops at the wrapper of CEX deposit addresses, custodial vaults, OTC desks, and bridge contracts where public visibility ends. A whale that primarily operates through CEX accounts and OTC desks has a small on-chain footprint relative to their actual activity, and on-chain forensics will systematically under-state their importance. Off-chain data — SEC filings, fund disclosures, exchange transparency reports — fills some of this gap, but those are different disciplines.

Identity is inferred, never confirmed. Wallet clustering produces probabilistic groupings, never definitive identities. Even high-confidence clusters can be wrong — an unusual but legitimate operational pattern (one fund managing multiple sub-strategies through fragmented addresses, or one analyst running multiple independent research wallets) can produce signals that look identical to multi-address sybil operation. The output of forensics is always "wallets X, Y, Z probably share a common operator" — the "probably" is load-bearing.

The graph grows fast. A wallet 3-4 hops from a seed typically has thousands of indirect connections, most of which are coincidental rather than meaningful. Effective forensics requires good filters as much as good heuristics — without aggressive filtering, the cluster expands until everything is connected to everything and the analysis loses signal. The discipline is as much about deciding what to exclude as what to include.

Every data point in this methodology is verifiable on a public block explorer. The interpretation is yours. The conclusions you draw should reflect your own evidentiary standards, your tolerance for uncertainty, and the consequences of being wrong about any specific wallet attribution.

Frequently asked questions

How accurate is wallet clustering in practice?

For high-confidence clusters (score 10+ on the four primary heuristics), professional researchers report accuracy in the 85-95 percent range when validated against ground truth (publicly disclosed wallet ownership). Medium-confidence clusters drop to 60-75 percent accuracy. Low-confidence clusters are essentially indistinguishable from chance. The pattern is: clustering is reliable for the most coordinated wallets and unreliable for the most independent ones — which inverts what you might intuitively expect, because the most coordinated wallets leave the most clustering signals.

Can on-chain forensics identify a real-world person?

Forensics alone cannot — on-chain data has no name attached to it. What forensics can do is link multiple wallets to a common operator, and if any one of those wallets has a publicly disclosed identity (Twitter post, SEC filing, exchange KYC leak, ENS domain with real name), the identity propagates probabilistically across the cluster. The identification is therefore second-order: forensics builds the cluster, an external disclosure attaches an identity, and the cluster propagates the attribution. Without the external disclosure, the best forensics can do is "this cluster of addresses operates as a single decision-maker."

What's the difference between forensics and blockchain analytics?

Blockchain analytics typically focuses on aggregate metrics — daily volume, active addresses, network fee revenue, protocol TVL — without resolving individual identities. On-chain forensics is the address-level layer beneath analytics: tracing specific funds, clustering specific wallets, and reconstructing specific transaction graphs. Analytics answers "what is the network doing"; forensics answers "who is doing this specific thing." Most professional research combines both — analytics to identify interesting patterns, forensics to identify the actors behind them.

How does forensics interact with whale tracking?

Whale tracking without clustering treats every address as an independent actor and over-counts whale activity. Whale tracking with clustering compresses the address universe to the actor universe and produces more accurate aggregate signals. Deep Blue Alpha's whale wallet leaderboard applies clustering selectively to known multi-address operators — typically funds and identified institutions — while leaving unclassified addresses uncombined to avoid false attribution. The companion smart money tracking guide covers the conviction-scoring layer that sits on top of clustered wallets.

Where can I learn more about specific clustering techniques?

Academic literature on Bitcoin transaction clustering is the foundation; the Meiklejohn et al. 2013 paper "A Fistful of Bitcoins" is the canonical reference for the common-input heuristic. Ethereum-specific clustering research is more fragmented, drawing from the gas analytics community, MEV researchers, and the post-airdrop sybil-detection space. Major industry players (Chainalysis, Elliptic, TRM Labs) publish methodology overviews; their detailed implementations are proprietary. For practitioner-level material, the Lookonchain Twitter feed publishes detailed worked examples of multi-wallet investigations regularly.

Is on-chain forensics legal?

Yes — on-chain data is public, and analyzing it is no different from analyzing any other public dataset. Where forensics intersects legal questions is in how the conclusions are used. Publishing accusations of specific identity based on probabilistic clustering can be defamatory; using forensics to evade sanctions or assist in illegal activity has its own legal exposure. The methodology itself is neutral; the use of the conclusions is what carries legal weight, and that's outside the scope of this guide.

Bottom line

On-chain forensics is the address-level discipline that turns raw blockchain data into actor-level analysis. Without clustering, every address is treated as a separate decision-maker, and aggregate on-chain metrics — holder distributions, whale activity counts, concentration ratios, smart-money cohort sizes — systematically over-state the actor universe. With clustering, the analysis compresses to the actual decision-makers, and the resulting signals are more accurate.

The 5-step methodology — seed address history, outbound classification, heuristic scoring, recursive expansion, external validation — takes about 45 minutes per investigation using free public block-explorer data. The four primary heuristics (common deposit address, timing correlation, gas-funding pattern, behavioral signature) combine into a 0-12 confidence score; high-confidence clusters (8+) are reliable in the 85-95 percent range, lower-confidence clusters degrade rapidly.

The framework is for resolving address-level ambiguity into actor-level groupings. The conclusions you draw from any specific cluster should reflect your own evidentiary standards and the consequences of being wrong about any specific wallet attribution. Every signal is probabilistic; the discipline is in calibrating the confidence honestly.

Apply clustering at scale

Deep Blue Alpha tracks tens of thousands of Ethereum whale wallets with selective clustering applied to known multi-address operators — producing actor-level whale flow signals across the broader market. Free, no signup, updated continuously.

Open the live dashboard →