(Not logged on) | Register | Log On

You can subscribe to this discussion group using an RSS feed reader. techInterview Discussion

Answers to technical interview questions. A part of TechInterview.org

A part of techInterview.org: answers to technical interview questions.

Your host: Michael Pryor

Find duplicates in a large dataset

You have an large dataset (a many-TB file) consisting of elements that each have a 128-bit GUID and a "record locator" (an arbitrary blob less than 1K in size), in no particular order.  You have workspace on disk that's about 150% the size of your dataset. 

Produce an output file containing the following: for each element in the input for which the GUID exists 2 or more times, produce an output element with the GUID followed by all of the appropriate record locators.  That is, collect all the duplicates, and discard all the non-duplicates.  No ordering is required in the output, but performance matters (measured in I/O terms; anything that happens in memory is free).
Skorj Send private email
Wednesday, November 28, 2007
 
 
I should clarify that the workspace mentioned is *in addition to* the space the dataset takes up: you have room for the original data, plus a copy, plus some space left over.
Skorj Send private email
Thursday, November 29, 2007
 
 
If GUIDs fit in ram: load,sort,filter just the GUIDs, make second pass to build the results. 2 * O(n) in terms of I/O.

If collision rate is high(high enough that all singles plus some will fit in memory ), and the total GUID set won't fit in memory load as much as will fit.  load, sort, filter what you have. write keepers to disk. repeat until you have finished list. second pass over input to build results. 4* o(n)
brian
Thursday, November 29, 2007
 
 
How about this?

Traverse the dataset and hash each GUID into a hash map and then add the blob to the end of the linked list at that location. If the linked at that location is empty then add the GUID to another linked list. Once traversal of dataset is complete then we can traverse the linked list containing the GUID's, and hash each GUID into the hash map, and then print the blob's contained in the linked list at that location. I'm assuming here that hash of GUID's won't collide, which shouldn't be difficult as GUID's are needed to be universally unique.
logan Send private email
Friday, November 30, 2007
 
 
2 answers:
If there's enough memory to store all the GUIDs
 - Build a hash table of the guids -> list of record locator offsets O(n)
 - Iterate through the GUIDs in the hash table, writing out the record locators for any GUID with more than two offsets stored.

If there isn't enough main memory to do that:
  - Merge sort the data
  - Iterate through it, copying any GUIDs that appear 2x or more in a row.
Joe
Friday, November 30, 2007
 
 
If you read the problem statement carefully, you'll note that the GUIDs are at least 1/9th of the multi-TB dataset, so they won't fit in memory.  Let's say specififcally that the GUIDs total 1TB, and you have 2GB of memory available to this problem, plus whatever memeory is needed for OS resources etc.


@Joe
How many read-write passes over the entire dataset would be required for a merge sort?  How would you minimize this? 

You can solve this problem without completely sorting the dataset, but perhaps the merge sort approach could be made just as fast?
Skorj Send private email
Friday, November 30, 2007
 
 
Ahh - you're right, they doesn't have to be totally sorted.

Read N elements into memory (where N is the number that cab be held in memory), sort them, write them back. Do this for every group of N elements. At the end of this step you've written and read every element once.

So now you have InitialSize/N sorted sections of the file. Open each one and iterate through them by moving to the next element for the file(s) with the lowest GUID, writing out duplicates as you go.

So it starts of like a merge sort, but replacesthe merging step with a custom duplicate GUID collecting step.

Saturday, December 01, 2007
 
 
Yup, that works well.

An alternative approach is to do the same optimization on a bin sort (aka bucket sort).  Divide the space of all GUIDs into 500 or so buckets (size of data set / size of memory), read the input file sequentially, and append each record to the appropriate bucket.

Now make a pass reading each bucket into memory, sorting it, and outputting the duplicate-free stream.  Now this approach only works well if the input GUIDs are evenly distributed.  If they aren't (eg, we were using a string instead of a GUID), you need some way to force an even distribution.

The solution to that problem is a "hash bin sort" (aka hash bucket sort), where you just use a hash table with each hash bucket represented by a file.  This approach has the useful property that all duplicates still end up in the same bucket, so the performance is the same for duplicate detection and even distribution is far more likely.

Of course, a (partial) merge sort is simpler and works just as well.
Skorj Send private email
Monday, December 03, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz