[OCLUG-devel] Matching Strings
Dan Stromberg
strombrg at dcs.nac.uci.edu
Thu Aug 4 15:15:30 PDT 2005
There's a variety of ways of doing this sort of thing, but some of the
easier ones that should still have a good running time, in pseudo code
look like:
A) Not absolutely blazing, but not slow, and very simple coding:
1) go through file, stuffing all matches in an array
2) pass the resulting array to qsort with an equivalence
function that only compares -all- characters in the
array
3) go through and print out all matches that aren't
adjacent and identical
Or perhaps a little bit better:
1) Initialize a suitably-sized hash table
2) Go through the file, stuffing all matches in the hash
table. The table is indexed by the matching prefix,
but contains an associated piece of data for every
matching line
3) Just run through the hash table, outputting all data
at each key
...or you could do much the same thing with a binary tree or something,
but a suitably-sized hash table will probably outperform the binary
tree, -and- be less likely to exhibit pathological behavior.
-Or-, if you have a truly huge list, then you might use a
database-backed "hash table", rather than keeping everything in memory.
On Thu, 2005-08-04 at 13:57 -0700, James Colannino wrote:
> Hey everyone. I'm writing a C program that will report either duplicate
> strings in a file or strings that are the same up to n characters. For
> example, let's say that I have the following file named text.txt:
>
> albert
> albert
> albany
> bert
> biscuit
> true
> trouble
> lone
>
> If I were to do the following:
>
> match text.txt
>
> It would search for complete matches and should output the following:
>
> albert
> albert
>
> If I were to do the following:
>
> match -n 3 text.txt
>
> It should output the following:
>
> albert
> albert
> albany
>
> But instead, it outputs this:
>
> albert
> albert
> albany
> albert
> albany
>
> I know why it does this. I look for matches in the following way:
>
> read string
> search rest of file below it for match
> read next string
> search rest of file below it for match
>
> Since albert is there twice, it reads the first instance of albert,
> outputs all matches below it, reads the second albert, and outputs all
> the matches below it, and so on.
>
> My question is, how can I write my program in such a way that these only
> show up once? In essence, I'd like it to report:
>
> albert
> albert
> albany
>
> Instead of:
>
> albert
> albert
> albany
> albert
> albany
>
> Here's a link to the source (it's probably not very good, but I tried to
> structure and comment it in such a way that it's at least easy to read
> -- I hope...)
>
> http://james.colannino.org/match.c
>
> Any help would be greatly appreciated :)
>
> James
> _______________________________________________
> OCLUG-devel mailing list -- OCLUG-devel at oclug.org
> http://mailman.oclug.org/mailman/listinfo/oclug-devel
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://localhost.localdomain/pipermail/oclug-devel/attachments/20050804/40b565cf/attachment.bin
More information about the OCLUG-devel
mailing list