I have to simplify custom regex expressions parsed to a certain data type. With "simplify" I mean the following (emphasis mine):
Given the rules:
- lowercase letters match themselves, eg.:
amatchesaand nothing else- parens enclosing only letters match their full sequence, eg.:
(abc)matchesabcand nothing else- square brackets enclosing only letters match every letters inside, eg.:
[abc]matchesaandbandcand nothing elseThe following are all valid:
(a[bc])matchesabandacand nothing else[a(bc)]matchesaandbcand nothing else(a(bc))is the same as(abc)and matchesabcand nothing else[a[bc]]is the same as[abc]and matchesaandbandcand nothing elseRegexes can be simplified. For example
[a[[bb]b[[b]]](c)(d)]is really just the same as[abcd]which matchesa,b,candd.
I have implemented a simple parser combinator in Haskell using attoparsec and the following destination data type:
data Regex
= Symbol Char
| Concat [Regex] -- ()
| Union [Regex] -- []
deriving (Eq)
However, I'm really struggling with the simplification part. I try to reduce the Concats and Unions by a combination of unwrapping them, nubbing and concatMapping to no avail. I think that the data type I have defined might not be the best fit but I have run out of ideas (late at night here). Could you help me look to the right direction? Thanks!
simplify :: Regex -> Regex
simplify (Symbol s) = Symbol s
simplify (Concat [Symbol c]) = Symbol c
simplify (Concat rs) = Concat $ simplify <$> rs
simplify (Union [Symbol c]) = Symbol c
simplify (Union rs) = Union $ nub $ simplify <$> rs
CodePudding user response:
You are missing a couple simple improvements, for starters. simplify (Concat [x]) = x and likewise for Union: there's no need for the wrapped regex to be specifically a symbol.
Then you need to start looking at Concats containing other Concats, and likewise for Union. Sure, you start by simplifying the elements of the wrapped list, but before jamming the result back into a wrapper, you lift up any elements using the same wrapper. Something like:
simplify (Concat xs) =
case concatMap liftConcats (map simplify xs) of
[x] -> x
xs -> Concat xs
where liftConcats :: Regex -> [Regex]
liftConcats r = _exerciseForTheReader
Then you can do something similar for Union, with a nub thrown in as well.
