Using a mass deduping tool on SFDC records
Once you get to any size of database in SFDC, it will be essential to get some real deduping capability — way beyond what's possible with the internal wizards (described in the section on dealing with individual dupes), which are good for when you only have 10-40 records to dedupe. There are several deduping tools available, and we’re making this text as vendor neutral as possible.
There are two main categories of deduping tool — those that do things on external data sets (Excel, Access, SQL databases) and those that work directly on data while it is held in SFDC. We’ve already covered deduping as part of the import process in the section on preventing duplicate Salesforce records, now our use case is removing dupes that are nicely entrenched in your production SFDC database. By far the fastest and best way to do that is in situ using the tools that work directly against the live data. However,any time you do a lot of data manipulation in a production SFDC system, you may cause triggers, workflows, roll-up fields to be re-evaluated, and interactions with external systems may also be invoked. These can bog down the system in a big way, and may cause unforeseen consequences that are quite ugly and tough to reverse so you’ll need to test thoroughly in your specific system before you do any mass deduping, no matter how good the tools are. Almost always, during the final merge process, you’ll want to quiesce the system to the extent possible and turn off relevant triggers and workflows.
If you need to explain to others what you are doing, you probably don’t want to show them all the details below…but you can show them this “lite overview.”
The basic mass-deduping strategy:
- Sit down. Have some warm milk. Relax. This is going to take some time. Deduping is a process, not an event. And successful deduping of a big, messy database will take weeks. Yes, really. Budget your time accordingly and set expectations with the people who'll need to review the merge candidates: you're going to need an hour a day of their time throughout the process. Yes, really.
- Before you start each day’s deduping, do a complete backup of your SFDC data. Every single record of every single object. Really. Saved on your servers. I know, it can take hours, even with the fastest backup tool. You'll thank me later. You can use SFDC's data export backup feature including all attachments…but iff you've already used this week's backup for other purposes (yes, SFDC allows you to use this feature only once a week), back up the data today using Data Loader, the deduping tool’s “mass backup” feature,or some other tool. Just download everything in the system, including all the history tables and chatter stuff).
- If you aren’t using SFDC Campaigns, you really should consider converting to them prior to deduping Leads and Contacts. The system’s standard Lead Source is a pick-list that can only hold one value. When you merge, if you have only the system’s Lead Sources, you’re almost certain to lose some valuable data. In contrast, Campaigns will preserve all the historical “touch” data for you (and enable a “touch history” that is valuable to marketers). Converting to Campaigns is a nice big spreadsheet exercise: examine all the Lead Sources in the system, create a campaign for every Lead Source currently in use, and then create campaign members for each Lead, Contact, Account, and Opportunity in the system. This can take at least an hour for every 500 Leads or Contacts you've got in the system...so budget time accordingly.
- Prior to deduping, normalize and cleanse your data so the deduping tool can do a better job of matching records. The goal is for the deduping tool to be able to use as many literal-string matches as possible so the fuzzy matching is more effective. The very first cleansing step: make sure all records are “owned” by active users, as records that aren’t have several strange behaviors (the SOQL query to find the zombie records is “SELECT id FROM <object> WHERE User.IsActive = FALSE”). Next, transform state and country strings into ISO-standard state and country codes, or at least move them to standard spelling (e.g., “USA” vs “United States”). Fix country and city codes on phone numbers (often, in European numbers there are extra zeroes that don’t apply when you’re dialing internationally). Make sure phone numbers aren't being represented as scientific notation (yes, really). Make sure postal code spacing is consistent (is it XY 12 AB or XY1 2AB or XY12AB?) and that ZIP codes are represented as text, not numbers (to avoid leading-zero suppression that makes the ZIP code for Amherst, MA look like "1002" instead of "01002"). Remove middle initials from first name fields. Handle first initials (like "J. Walter") and suffixes (like “Jr.”) in a consistent way. Standardize the way you handle compound last names (is it van der Heyden or van derHeyden or vander heyden?). Decide how you're going to handle accented and (non-ISO-Latin) characters in names, company names, street names and city names and normalize all these strings (e.g., so that forté becomes forte).
- Prior to deduping, identify fields that you'll want to preserve both the "winner" and "loser" values that both records might have right...even though they are different values (e.g., multiple phone numbers, email addresses, stage, status, owner, record type, lead score, rating, type). Once you've ID'd these vulnerable / valuable fields, create a new text-area field (it must be a text-area field) in the table you're deduping, and turn history tracking on for that field (you might want to call the new field "▼Extras" for reasons I won't explain) and use Excel concatenate formulas to generate the ▼Extras content. Due to length constraints (255 characters), you may need to create more than one Extras field. Also, create a text-area field for housekeeping information, such as “dead” ID numbers, merge details, and other information you may need in the future (yes, turn on history tracking for it, too). It’s a best practice to have the ▼Extras and ▼Housekeeping content in an XML format, for later parsing and processing, along these lines:
<mergeRationale>bad import from reality TV participants</mergeRationale>;
Once you’ve got the contents right in your spreadsheet, use Data Loader or Excel Enabler to populate each record with the new contents generated from those vulnerable fields. Once these ▼Extras fields are populated, you're almost ready to start...
- No, you're not. Back up ALL YOUR DATA again. Seriously. You will live to see the day when this will save your bacon.
- No, you're not ready (revisited). Back up the entire metadata model, using Eclipse. You want to make sure you can backtrack any changes you might have to make (or that some other creative admin is making) during your deduping process.
- Hopefully, you've got access to a FULL sandbox (if not, try to get someone to buy it for a month or two — it's really the only safe way!). Look in the sandbox login log to see who's been in there since the last refresh. Email each of them and ask if they have anything in there that they need to save. Before you do the refresh, backup all the sandbox data (REALLY) and use Eclipse to do a full backup of all of its metadata, in case somebody forgets to tell you about their crown jewels after you’ve erased them with the sandbox update. So now that you’ve got that Sandbox backup, do a full sandbox refresh (you only get one of these every 30 days, so you'll want to think about calendar effects for future refresh demands in the organization). On big systems, this may take days, so be patient.
- Use the full Sandbox to test each pass of deduping for each object to understand the sideeffects of merges. This is particularly true if they have external AppExchange apps installed. After you've validated the behavior in the Sandbox, you redo the exact same deduping procedure in their production system. NOTE that external apps may behave differently in production than in sandbox, so you'll still need to do some small batches of each dedupe cycle in production so you can spot the sideeffects and gotchas.
- Typically, plan to dedupe the "leaf nodes" first. This means things like Leads, Notes, and Tasks. They are the least dangerous to get wrong, and working on them will re-familiarize you with the deduping tool features and quirks, which are easy to forget.
- Almost always do Lead deduping at a single-level table first (e.g., Leads to Leads, then Leads to Contacts).
- Be very wary of deduping things that are at the top of an information tree (i.e., have lots of child and related records), no matter how tempting it may be to start there. Accounts are the most obvious case: you don't want to dedupe them until the very end due to the catastrophic impact of an erroneous merge. Although the merges are logged, they CANNOT be undone, so you better really mean it when you say "go." If you see possible dupes in the list that ought to have been caught earlier, stop and go back to understand why it wasn't trapped and fix that before you go on. Make sure that earlier de-dupe candidates aren’t being re-instantiated by some code somewhere.
- If you're deduping something that has an external data key (for example, a pointer to a record in an integrated system or an industry standard database like Dunn & Bradstreet) you need to make sure that field is visible in your deduping tool. Generally speaking, DO NOT merge records that have external keys. While they may be legitimate dupes, merging them will wreak havoc on the external pointers. You'll need to do some more serious system work to get rid of these cases, and many companies choose to leave them in place.
- For each dedupe cycle, use a whiteboard to diagram:
- What fields and rules you are going to use to match on
- What fields you need to see when looking at the match report
- What rules and criteria you are going to use to determine the "winner" (master) in the merge transaction.
- In sales organizations, the Owner of a record can be a hugely political factor in determining the winner. Make sure you've bullet-proofed your "ownership" criteria and outcomes...so that you won't have to wear a bullet-proof vest to work.
- Be REALLY careful about how you handle subsets of merge candidates (for example, oldest record wins in the general case, but that's only after a first pass of merging has been done with a primo list of leads you just collected).
- What situations should preempt merges (for example, don't merge a pair if foreign keys are present in both candidates and the external system can't tolerate merges). This one is tricky, as none of the tools handles this automatically. You’re just going to have to sit there and think.
- The order of your matching passes and merging cycles.
- Take a picture of what's on your whiteboard before you erase it for the next dedupe pass, and save that picture in your dedupe log book.
- Set up the dedupe tool to append/combine the ▼Extras fields and any other long-text fields in your table. If your tool doesn’t have this feature, you’ll have to do this as an Excel exercise after the merge.
- Never blindly accept a merge result set; always scan at least three fields for each record to find typos, obsolete email domains, abbreviations and mismatches. Talk with the users and you may find there are a number of additional fields you need to look at (e.g., owner, territory, status) to detect problems.
- Involve a user representative in approval of each merge set before you do the deed! Do a trial run of the merge-candidate logic with a large representative set of records. Divide this set up into batches of no more than 2000 records and distribute these to end users for review (give them no more than 36 hours to get back to you). You'll need them to do a reality check on whether each of the records should be merged and which records are the “winner” in the process. EVERY TIME, create a spreadsheet export of the merge candidates. When you do this, in CRMfusion you'll need to replace column E of the spreadsheet values "28" with "winner" and "29" with nothing (aka: blank, null, nada, bubkes). Get user authorization before EVERY merge — you have no idea how many little details can get in the way.
- Now, before you start doing even your first merge, you probably want to turn off all the validation rules, workflows, and triggers for the object in question. In really big systems, there's all manner of field changes and alert emails that may be fired off by merges. Further, there are all kinds of data conditions that may cause the merge to fail (in particular, due to old data banging into new validation rules). You may also discover that Marketo or other related plug-in systems may have campaigns, scoring systems and alert emails that may fire off with merges...so talk with your local marketing automation maven to get that stuff turned off during your merge cycles.
- Do the deduping on each object in several passes, starting from the most stringent match criteria. As you relax the criteria over several passes, watch the match results to make sure you aren't getting a lot of false matches. You'll be able to select/de-select as many of the match result sets as you like, but if the exceptions are more than 25 percent of the set, you're beginning to waste time.
- The goal is to improve data quality as you remove dupes. Some deduping tools let you correct errors you might spot in the result set. If so, fix these data quality issues on the spot.
- Keep a log book of every deduping cycle you do! Best to take notes about the objective of each pass, the timing, the name of the scenario files, and the name of the log file created. At some point, you'll need to reverse some deduplication and your only hope is to have notes to reconstruct the crime so you can reverse it. Did I mention there's no undo for merges, and no backup is kept EXCEPT to reconstitute the records from the backups that you make yourself?
- With each deduping cycle, there will be a number of merge-tool settings that change. Save each and every one of these scenario and master-rule changes as a separately named file (like "Client-Leads-Pass1"). Make sure the file name is indicative of what problem you were trying to solve, as you'll never remember what you were thinking the next time you run the tool. Seriously, you'll need these!
- At the end of each deduping cycle, do any post-merge “unpacking” of the ▼Extras content you created. Since a given row may actually be merged in more than one pass, you don’t want that content to get scrambled. Typically, it’s a good idea to put all the ▼Extras and ▼Housekeeping content from prior cycles into a single hidden long-text field (yes, make it 32K characters long) called something like ▼MergeLog.
- Merging large dedupe sets takes quite a while due to computation realities and the cycle-time of web services. Budget enough time to really do the job if you've got thousands of dupes. However, do not do all of them at once; take breaks every hour or so to clear your head. If you have a large merge set, it will take a couple of minutes per 100 dupes. Get up, walk around. Seriously, breathing and pacing increases your IQ and prevents silly errors.
- Note that most deduping tools’ fuzzy matching process is very CPU intensive and suffers from “combinatorial explosion” problems. In most cases, batches should be only a few thousand at a time. (I’ve done a batch of more than 20,000 and it took forever.) You’ll need to figure out a data partitioning strategy (typically, something like geographic or alphabetical) and a batch-logging system so you know where you are at all times in your data set. For really big data sets, you can have several instances of the deduping tool running in parallel (on separate machines). Typically, this is a “laptop farm” (you can’t use servers because the UI is an essential part of the deduping tool’s operation). I like using sticky-notes on each keyboard so I don’t lose track of each system’s batch and status.
- When you’re “done for the day” with merge work, don’t forget to turn triggers and workflows back on, and turn the history-tracking for your ▼Extras and ▼Housekeeping off.
- As you go through merge passes, look for patterns of duplicates and identify the situations in which they occur, the kinds of errors or formatting problems, the lead sources and time stamps to help identify the original source of the duplication. After you get familiar with your data set, the duplicate sets will have an identifiable fingerprint that helps isolate which user behavior (a result of a usability or training issue), business process, internal code, external source is causing the creation of dupes. Keep a list of these items to fix when you’re done with the deduping, because, well, you don’t want to have to do this forever.
Expect that deduping is a process, not an event. It's a gradual improvement in data quality, and the best practice is to hit the database at least every couple of weeks to identify new problems. Documentation of your dedupe cycles also facilitates training and business process improvement.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.