This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: A useful syntax for regexps and other things



I really hate this whole regular expression thing.  Every program
seems to have different escaping conventions, different extra
features, etc.  Trying to port my URL parsing code from stk to scm to
guile was driving me insane.  Moving regexp code from stk to elisp was
equally painful.  Fixing buggy regular expressions is a nightmare.

Regular expression strings are the worst thing to happen to
programming - they're powerful and completely unreadable.

In scheme one can do much better.  When I was moving my URL parsing
code around I finally wrote a function

   (list->string-regexp lst)

which would take a regular expression such as 
   '((set "^a-zA-Z_$")
     (group "read" or "readv" or "readln"
	    or "write" or "writev" or "writeln"
	    or "reset" or "extend" or "rewrite"
	    or "close")
     (zero-or-more whitespace)
     "(")

and converts it to
   "[^a-zA-Z_$]\\(read\\|readv\\|readln\\|write\\|writev\\|writeln\\|reset\\|extend\\|rewrite\\|close\\)\\s-*("

(at least this is what my elisp version creates).

I think the list form is much more readable, hackable from within
code, etc. 

My only regrets are that a) I didn't follow bigloo's notation (so as
to be compatible with pre-existing work - my notation isn't any better
than bigloo's notation, so there's no reason to invent a new one), 
b) I didn't do it sooner, and c) I didn't do a full and complete job
of it.

If I recall correctly, the author of scsh posted to the scsh mailing
list over a year ago about doing something along these lines.

Here's my elisp version.  I don't have the scheme version handy, and I
wrote this version without trying to do something especially general,
but it basically works.

;; list2regexp - convert a readable regexp to a string regexp.
;; Copyright (c) 1997, Harvey J. Stein, hjstein@bfr.co.il, all rights reserved
;; This code is licensed for use under the GNU LGPL.
;; A readable regexp looks like:
;; regexp : 
;;          string                     - Match this string exactly.
;;          whitespace                 - Match whitespace
;;          char                       - Match any character
;;          (regexp1 regexp2 ...)      - Match regexp1 followed by regexp2 ...
;;          (or regexp1 regexp2 ...)   - Match regexp1 or regexp2 or ...
;;          (group regexp1 regexp2)    - Match regexp1 followed by regexp2, but group results.
;;          (member string)            - Match any character in string.
;;          (not-member string)        - Match any character not in string.
;;          (one-or-more regexp)       - Match regexp 1 or more times.
;;          (zero-or-more regexp)      - Match regexp 0 or more times.
;;          (zero-or-one regexp)       - Match regexp 0 or 1 time.
;;; Set these up for your particular scheme regexp package...
(defvar regexp-start-group "\\(")
(defvar regexp-end-group "\\)")
(defvar regexp-start-set "[")
(defvar regexp-end-set "]")
(defvar regexp-one-or-more "+")
(defvar regexp-zero-or-more "*")
(defvar regexp-zero-or-one "\\?")
(defvar regexp-or "\\|")
(defvar regexp-begin "^")
(defvar regexp-end "$")
(defvar regexp-any-char ".")

(defvar regexp-word-char "\\w")
(defvar regexp-not-word "\\W")
(defvar regexp-word-start "\\<")
(defvar regexp-word-end "\\>")
(defvar regexp-whitespace "\\s-")
(defvar regexp-open-parenthesis "\\s(")
(defvar regexp-close-parenthesis "\\s)")
(defvar regexp-symbol-char "\\s_")
(defvar regexp-punctuation "\\s.")
(defvar regexp-string-quote "\\s\"")
(defvar regexp-escape "\\s\\")
(defvar regexp-char-quote "\\s/")
(defvar regexp-paired-delimiter "\\s$")
(defvar regexp-expression-prefix "\\s'")
(defvar regexp-comment-starter "\\s<")
(defvar regexp-comment-ender "\\s>")


(defun list->regexp-string (l &optional quote)
  (cond ((null l) "")
	((and (listp l)
	      (symbolp (car l)))
	 (case (car l)
	   ((group) (concat regexp-start-group
			    (list->regexp-string (cdr l) quote)
			    regexp-end-group))
	   ((set)   (concat regexp-start-set
			    (list->regexp-string (cdr l) quote)
			    regexp-end-set))
	   ((one-or-more) (concat (list->regexp-string (cdr l) quote)
				  regexp-one-or-more))
	   ((zero-or-more) (concat (list->regexp-string (cdr l) quote)
				   regexp-zero-or-more))
	   ((zero-or-one) (concat (list->regexp-string (cdr l) quote)
				  regexp-zero-or-one))
	   ((begin) (concat regexp-begin
			    (list->regexp-string (cdr l) quote)))
	   ((end) (concat regexp-end
			  (list->regexp-string (cdr l) quote)))
	   ((any-char) (concat regexp-any-char
			       (list->regexp-string (cdr l) quote)))
	   ((whitespace) (concat regexp-whitespace
				 (list->regexp-string (cdr l) quote)))
	   ((symbol) (concat regexp-symbol-char
			     (list->regexp-string (cdr l) quote)))
	   ((word-start) (concat regexp-word-start
			     (list->regexp-string (cdr l) quote)))
	   ((word-end) (concat regexp-word-end
			     (list->regexp-string (cdr l) quote)))
	   ((word) (concat regexp-word-char
			   (list->regexp-string (cdr l) quote)))
	   ((not-word) (concat regexp-not-word
				 (list->regexp-string (cdr l) quote)))
	   ((or) (concat regexp-or
			 (list->regexp-string (cdr l) quote)))
	   ((token)  (list->regexp-string (cons 'word-start
						(cons '(one-or-more (set "-a-zA-Z0-9_$"))
						      (cons 'word-end
							    (cdr l))))
					  quote))
	   ((escape) (list->regexp-string (cdr l) t))
	   ((unescape) (list->regexp-string (cdr l) nil))))

	((listp l)
	 (concat (list->regexp-string (car l) quote)
		 (list->regexp-string (cdr l) quote)))
	(quote
	 (regexp-quote l))
	(t 
	 l)))