Friday, July 9, 2010

Python - Remove duplicate lines from file


Objective : Remove duplicate lines from a file (print first occurrence) which appeared exactly twice.

Input file:

$ cat file.txt
begin
ip 172.17.4.53
line 172.17.4.52
pl 172.17.4.51
pl 172.17.4.51
new 172.17.4.52
line 172.17.4.52
pl 172.17.4.51
end

Required: Remove duplicate lines from the above file i.e. print only the first occurrence of the lines which appeared exactly twice and for lines those appear more than twice or appeared only once, no action required.

i.e. Required output should look like this:

begin
ip 172.17.4.53
line 172.17.4.52
pl 172.17.4.51
pl 172.17.4.51
new 172.17.4.52
pl 172.17.4.51
end

The python script 'remove-duplicate.py' :

d = {}

fp = open("file.txt.nodup","w")
text_file = open("file.txt", "r")
lines = text_file.readlines()
for line in lines:
if not line in d.keys():
d[line] = 0
d[line] = d[line] + 1

for line in lines:
if d[line] == 0:
continue
elif d[line] == 2:
fp.write(line)
d[line] = 0
else:
fp.write(line)

Executing it:

$ python remove-duplicate.py
$ cat file.txt.nodup
begin
ip 172.17.4.53
line 172.17.4.52
pl 172.17.4.51
pl 172.17.4.51
new 172.17.4.52
pl 172.17.4.51
end