[OCLUG-devel] Matching Strings

Dan Stromberg strombrg at dcs.nac.uci.edu
Thu Aug 4 15:15:30 PDT 2005


There's a variety of ways of doing this sort of thing, but some of the
easier ones that should still have a good running time, in pseudo code
look like:

A) Not absolutely blazing, but not slow, and very simple coding:
	1) go through file, stuffing all matches in an array
	2) pass the resulting array to qsort with an equivalence
          function that only compares -all- characters in the
          array
	3) go through and print out all matches that aren't
	   adjacent and identical

Or perhaps a little bit better:

	1) Initialize a suitably-sized hash table
	2) Go through the file, stuffing all matches in the hash
	   table.  The table is indexed by the matching prefix,
          but contains an associated piece of data for every
          matching line
	3) Just run through the hash table, outputting all data
          at each key

...or you could do much the same thing with a binary tree or something,
but a suitably-sized hash table will probably outperform the binary
tree, -and- be less likely to exhibit pathological behavior.

-Or-, if you have a truly huge list, then you might use a
database-backed "hash table", rather than keeping everything in memory.

On Thu, 2005-08-04 at 13:57 -0700, James Colannino wrote:
> Hey everyone.  I'm writing a C program that will report either duplicate 
> strings in a file or strings that are the same up to n characters.  For 
> example, let's say that I have the following file named text.txt:
> 
> albert
> albert
> albany
> bert
> biscuit
> true
> trouble
> lone
> 
> If I were to do the following:
> 
> match text.txt
> 
> It would search for complete matches and should output the following:
> 
> albert
> albert
> 
> If I were to do the following:
> 
> match -n 3 text.txt
> 
> It should output the following:
> 
> albert
> albert
> albany
> 
> But instead, it outputs this:
> 
> albert
> albert
> albany
> albert
> albany
> 
> I know why it does this.  I look for matches in the following way:
> 
> read string
> search rest of file below it for match
> read next string
> search rest of file below it for match
> 
> Since albert is there twice, it reads the first instance of albert, 
> outputs all matches below it, reads the second albert, and outputs all 
> the matches below it, and so on.
> 
> My question is, how can I write my program in such a way that these only 
> show up once?  In essence, I'd like it to report:
> 
> albert
> albert
> albany
> 
> Instead of:
> 
> albert
> albert
> albany
> albert
> albany
> 
> Here's a link to the source (it's probably not very good, but I tried to 
> structure and comment it in such a way that it's at least easy to read 
> -- I hope...)
> 
> http://james.colannino.org/match.c
> 
> Any help would be greatly appreciated :)
> 
> James
> _______________________________________________
> OCLUG-devel mailing list -- OCLUG-devel at oclug.org
> http://mailman.oclug.org/mailman/listinfo/oclug-devel
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://localhost.localdomain/pipermail/oclug-devel/attachments/20050804/40b565cf/attachment.bin 


More information about the OCLUG-devel mailing list