Deduping in Salesforce: The complete how-to

Deduping in Salesforce: The complete how-to

Don’t be duped by duplicate records. Deduping seems like one of those routine system maintenance issues, but it’s actually very tricky and there's no undo.

Preventing duplicate Salesforce records

Duplicate record prevention is a cute little topic, but it’s part of a much larger data quality topic.  The big picture is embodied in this sentence:  no record should be created without some level of procedural or automated controls.  Those controls should help enforce data quality and fidelity concerns, including naming conventions and record ownership.

There are three major sources of records in a CRM system:  user input, data imports and data integrations.  Each of these needs to be managed differently, but with a coherent strategy.

Let’s start with the simplest case:  user input.  Ideally, every record in every object should be checked to make sure it isn’t a dupe.  This can range from the brutal (duplicate names not allowed, via a data definition language constraint) or subtle (a fuzzy match on several criteria, with a warning message on-screen that allows the user to override the warning as needed.  While users may not mind that much if things are brutal, code (including test code), imports, and integrations will mind a lot if the same brutality is applied without exception.  Our recommendation is generally to move away from brutal measures and focus on warnings and follow-up emails beseeching the user to make sure the apparent dupe record has a business purpose.  There are products in the Salesforce AppExchange that do a good job of this, or if you have really tricky needs you may have to write some code.

Record imports are a bit less cut-and-dry, but actually allow for the nicest outcomes.  Typically, data imports have a number of data quality problems (such as formatting, capitalization and column-matching) in addition to being a source of potential dupes.  For imports, you’ll want to resolve as many of the data quality problems as possible before you start deduping, as more-complete records mean better dupe detection.  And, of course, you’ll want to dedupe within the import set before you dedupe between the import set and existing CRM data.  Since you have time and freedom to test imports before the real import, you can make the cycle squeaky-clean.

Typically, import data sets do not have any kind of universal identifier (such as a D-U-N-S or social-security number), so you’re depending on fuzzy matching.  Each source of external data will have its own optimal matching parameters (such as “how to clean account names of articles and punctuation” or “how country codes and city codes are handled in phone numbers”), and you’ll want to look at each batch of data carefully while setting the fuzzy-match parameters and scoring.  Document each source so you can repeat the recipe next time.  Most of the time, fuzzy matching products provide either  a series of prioritized rules (first look for verbatim name, then fuzzy name, then phone number) or a set of scoring/weighting parameters.  I happen to like scoring-based parameters, but both prioritized rules and scoring work well if you’re careful.  

When it comes to Accounts and other tables, the CRM system is almost invariably the preferred point of import.  In contrast, for Leads and Contacts, many marketing automation systems have their own importers, and the vendors often recommend you use them, rather than the CRM system’s.  That's fine, but be aware that most marketing automation systems use only email addresses as the basis of deduping, and there isn’t much fuzziness or flexibility:  if someone has multiple email addresses, they are often represented as multiple people.  Consult with your marketing automation vendor about their recommended strategy for handling this. 

Perhaps the hardest case for dupe prevention comes from data integrations between the CRM and outside batch or synchronous sources, because each of those sources needs to be considered separately—each requiring its own handler.  Unfortunately, those handlers almost always have to be custom code embodied in the middleware or the source application itself.  Typically, other applications or outside data feeds don’t know how to deal with an error generated by dupe detection/rejection in the CRM system.  If you’re really lucky, the app will provide detailed error logs and have workflow managers that know how to put troublesome transactions and their dependencies into suspension until the underlying issue is resolved.  If you’re somewhat lucky, the record (or the batch) will simply be rejected and thrown into the retry queue.  If you’re not lucky, the record will simply be thrown into an error status and the external app’s errors will pile up indefinitely.  Yum.

Obviously, the best strategy for external apps is for them to query SFDC for likely-dupe records before the external app tries to create the new record.  If an appropriate match is found, the external app should simply do an update to the existing record.  Some applications’ transaction managers will support this…but only some.  Get ready for coding.

Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.

Join the newsletter!


Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

More about ABBradstreetClickCreationDuPontEclipseExcelFord MotorGoogleIDGISOMarketoMicrosoftNBCParallelsSalesforce.comSimpleTestTyco

Show Comments