SSN indexed data is commonly seen and stored in many file systems. The trick to accelerate the speed on Spark is to build a numerical key and use the sortByKey operator. Besides, the accumulator provides a global variable existing across machines in a cluster, which is especially useful for counting data.
Single machine solution
#!/usr/bin/env python # coding=utf-8 htable = {} valid_cnt = 0 with open('sample.txt', 'rb') as infile, open('sample_bad.txt', 'wb') as outfile: for l in infile: l = l.strip() nums = l.split('-') key = -1 if l.isdigit() and len(l) == 9: key = int(l) if len(nums) == 3and map(len, nums) == [3, 2, 4]: key = 1000000*int(nums[0]) + 10000*int(nums[1]) + int(nums[2]) if key == -1: outfile.write(l + 'n') else: if key notin htable: htable[key] = l valid_cnt += 1
with open('sample_sorted.txt', 'wb') as outfile: for x in sorted(htable): outfile.write(htable[x] + 'n')