Thursday, April 16, 2009

Keep first unique field using python


Input file:
$ cat file.txt
1239941013,A,K
1239941013,T,K
1239941013,Z,T
1239941210,J,L
1239941210,Q,W
1239941519,K,P
1239941013,N,P
1239941013,S,P

Required: Remove the duplicate first fields (keep only first unique first field). i.e. required output:

1239941013,A,K
,T,K
,Z,T
1239941210,J,L
,Q,W
1239941519,K,P
1239941013,N,P
,S,P


Python script for the same:

fp = open("file.txt", "rU")
lines = fp.readlines()
fp.close()

f_f=" "
for line in lines:
f=line.split(",")
if f[0]==f_f:
print ","+",".join(f[1:]).rstrip()
else:
f_f=f[0]
print line.rstrip()

Executing the script:

$ python remove-dup-ff.py
1239941013,A,K
,T,K
,Z,T
1239941210,J,L
,Q,W
1239941519,K,P
1239941013,N,P
,S,P

Related functions and concepts:
1) str.split([sep[, maxsplit]])
Return a list of the words in the string, using sep as the delimiter string. read more here

2) str.rstrip([chars])
Return a copy of the string with trailing characters removed read more

3) str.join(seq)
Read here

An example on python join used above:

$ python
>>> line="1239941013,A,K"
>>> f=line.split(",")
>>> f
['1239941013', 'A', 'K']
>>> ",".join(f[1:])
'A,K'

0 Comments: