• AP (unregistered)

    Frist

  • Some guy (unregistered)

    Is this missing an introductory paragraph?

  • Sazanami (unregistered)

    Yeah, you kind of dropped that comment out of nowhere.

  • chreng (unregistered)

    Somebody wants to explain the last, beautiful, entry more in detail?

  • stardust (unregistered) in reply to chreng

    No.

  • Tim (unregistered)

    What completely baffles me is how someone could create or maintain that regular expression in the last example without it occurring to them that this is a really stupid way to go about things

  • kktkkr (unregistered)

    What if out there somewhere is the secret to ancient magic and curing cancer and infinite power, or perhaps just rainbow unicorns, and it's just hidden in enterprise code to find a document ID?

  • Brad Wood (google)

    After recently discovering that a valid RegEx pattern can cause an infinite loop (http://stackoverflow.com/questions/1200655/how-to-avoid-infinite-loops-in-the-net-regex-class), I've decided to never use RegEx again. Never.

  • regex explainer (unregistered)

    the last in-house regex-like language seems to missing an ignore uppercase/lowercase functionality. So "hello" would become "[H|h][E|e][L|l][L|l][O|o]" - not very readable. Please note that in a real regex-language [H|h] would mean the 3 characters "H", "|" and "h" whereas here you must write a pipe as an or-method. Wouldn't [Hh] just work here, too? I guess not, so other characters in those square backets most likely have special meaning, too.

    Second, the in-house regex-language is missing an question-mark-function and a \d function. So they wrote: ([C|c][P|p][K,<|k,<][0-9]{11})||([:#.$",'#-/|][C|c][P|p][K,<|k,<][0-9]{11} )||( [C|c][P|p][K,<|k,<][0-9]{11}[ :.$",'#-/|l\])||([:.$",'#-/|][C|c][P|p][K,<|k,<][0-9]{11}[:.$",'#-/|l\]) But this is similar to: y|xy|yx|xyx with: let x=[C|c][P|p][K,<|k,<][0-9]{11} let y=[:.$",'#-/|]

    In a real regex language you could just have written (x)?y(x)? or if you have back-references: (x)?y(\1)?

    That means: instead of having a 403 character-expression, you would have an 11-character expression in a real regex: ([:.$",'#-/|])?cpk\d{11}(\1)?

    The real WTF is their limited in-house-regex. But maybe they don't need readablity, because the regex is generated code.

  • Erik (unregistered)

    Maybe Kate should have read the documentation for parse_ini_file() before assuming it had the same behavior as using RegEx:

    Note: There are reserved words which must not be used as keys for ini files. These include: null, yes, no, true, false, on, off, none. Values null, off, no and false result in "". Values on, yes and true result in "1". Characters ?{}|&~!()^" must not be used anywhere in the key and have a special meaning in the value.

    Since PHP 5.3 you can use: parse_ini_file($iniFile, true, INI_SCANNER_RAW). Before that it was quite normal to use RegEx to avoid this behavior.

  • RichP (unregistered)

    Remy missed one other way that REs are like a multi-tool: if the job is difficult enough that you should use the "real" knife, screwdriver, or pliers, using the multitool results in a bloody mess of broken parts.

  • (nodebb)

    After recently discovering that a valid RegEx pattern can cause an infinite loop

    Only because of the type of matching engine used. The engines that use stacks (as all the ones that trace a vague heritage from Perl do, most of them via PCRE) have a number of weaknesses, and this is one of those examples. Though I'm not sure that the loop is infinite; it might just be O!M!G! huge. (10100 for sure isn't infinity, but it takes along time to count through all the same.)

    Engines that use automata-theoretic approaches don't have this weakness, but can take a lot longer to compile REs and are far harder to debug when anything goes wrong. Nobody really understands finite-state automata at the best of times…

  • (nodebb)

    And holy shitballs, the code that drives this site is bad. Suddenly, proper forum software looks better than I thought previously…

  • Rich Hendricks (unregistered)

    This site has saved my hide more than once. https://regex101.com/

  • Wolf (unregistered)

    Speaking of .ini files... Anyone know of a decent .ini parsing library in C? This would be for a microcontroller reading a micro SDcard.

    Thanks, Wolf

  • Whatever (unregistered) in reply to Brad Wood

    Testing that on regex101: https://regex101.com/r/D50Fr4/1

    pcre (php) gives: "Catastrophic backtracking" while Javascript produces a match.

  • Herby (unregistered)

    On regular expressions:

    Now you have two problems.

    In my experience, if a regular expression is over ONE line, you really do have problems. If it more than 40 characters, you should start looking at your methodology.

  • Matt Westwood (unregistered)

    Best fun I ever had was writing a regular expression to validate a UK vehicle registration plate.

    Sorry, I've lost the use of the key that inserts irony tags into a comment.

  • David (unregistered) in reply to Brad Wood

    Instead of using RegExs because "a valid RegEx pattern can cause an infinite loop", you're going to use a full programming language, where a valid program can include an infinite loop, or trashing any number of files, or opening a backdoor on the system? Seems that's out of the frying pan and into the fire.

  • Kashim (unregistered) in reply to Herby

    Regular expressions are like threading: it isn't horrible to write, but maintenance gets worse and worse as time goes on. Both are incredibly powerful, sometimes even necessary, but if you don't use them properly, they will crush you in technical debt.

    My soft rule for RegEx: If you have one giant RegEx doing several things, try to split it up into several small, easy to recognize RegEx. If you can write a RegEx to do exactly what you need, but it takes 3 lines, you can probably also write a series of 4 RegEx that are much easier to read, and can be combined to produce the same result.

    For example, that giant RegEx is just one huge friggin OR statement. Instead of one huge RegEx, you could easily split it into 50 smaller ones, one for each format of whatever you are searching for. It is less efficient probably, though not much, but it also doesn't make the next guy's brain melt searching for that wayward close peren.

  • Kashim (unregistered)

    Also remember, do not parse html or xml with RegEx.

    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

  • Ulysses (unregistered) in reply to David

    This is highly regular, Dave.

  • Sam Kington (unregistered)

    OK, the first two have bugs, but they're a reasonable use of regexen. That span in the first example could easily contain non-numbers as well as the number being incremented (e.g. a button or logo or text label), so it's not insane to fish out the number, increment or decrement it, and then put it back. And sanitising inputs is obviously a task for a regex.

    The problem the larger regexen have is not using the /x switch (or equivalent in other languages). Obviously a regex is borderline illegible if you're not allowed whitespace or any other kind of formatting; but the same is true for any code.

  • Matteo Italia (unregistered) in reply to Wolf

    Write it yourself, tailored to your needs; really, parsing INI files is as trivial as parsing can get, even if you get fancy and handle semicolon and hash comments.

  • Olivier (unregistered) in reply to Rich Hendricks

    I like to use The Regexp Coach for that, it works on Windows, but also in Wine.

  • Hannes (unregistered) in reply to Erik
    Note: There are reserved words which must not be used as keys for ini files. These include: null, yes, no, true, false, on, off, none.

    What about "FILE_NOT_FOUND"?

  • Overpaid Consultant (unregistered) in reply to Brad Wood

    It takes a special kind of person to take the only class of formal languages where pretty much every interesting question is decidable (*) and implement it in a way where infinite loops are possible.

    (*) The only counter example that comes immediately to my mind is 'Is language L regular?' which is an undecidable problem.

  • David Riley (unregistered) in reply to Wolf

    http://ndevilla.free.fr/iniparser/html/index.html

  • Robin888 (unregistered)

    RegExes can even test prime numbers: /^1?$|^(11+?)\1+$/

  • Frank (unregistered)

    "You have a problem and solve it with RegEx? Now you have two problem..."

  • (nodebb) in reply to Brad Wood

    ([^]]*(]")?)+ is a very confusingly written regex though; I think it means the same as ([^]]|]")*, which is much simpler for the regex engine too.

  • (nodebb) in reply to Robin888

    Regexes can even verify Sudoku solutions! In Ruby, it's as short as ^(?!.*(?=(.))(.{9}+|(.(?!.{9}*$))+|(?>.(?!.{3}*$)|(.(?!.{27}*$)){7})+)\1).

  • (nodebb) in reply to Robin888

    RegExes can even test prime numbers: /^1?$|^(11+?)\1+$/

    :wtf: that's extremely clever, but you left out the important part. It's looking for a sequence of one 1, or some subsequence of length > 1 repeated exactly some number of times > 1. In other words, a non-prime number of 1s. First you have to make a string of the digit n repeated n times. Then the regexp returns false if n is prime and true if it's not. So the test for whether a number is prime would be like:

    function isPrime(n) { return n % 1 == 0 && !/^1?$|^(11+?)\1+$/.test('1'.repeat(n)); }

    Of course, the number can't be negative, or larger than the maximum string size, or it'll throw a RangeError... also, I'm not sure what the purpose of that ? is in the (11+?) term. It seems like it's not doing anything.

  • (nodebb) in reply to anotherusername

    (reply, since it's not letting me addendum that)

    It's looking for a sequence of zero or one 1, or some subsequence of length > 1 repeated exactly some number of times > 1.

Leave a comment on “Keeping it Regular”

Log In or post as a guest

Replying to comment #:

« Return to Article