admin 管理员组

文章数量: 1086019

I'd like to be able to constrain user input to a white list of valid characters, but I don't want to prevent people from other cultures from signing up. So far, I have this:

^[a-zA-Z0-9èéêëàáâãäçìíîïòóôõöùúûü-_]*$

It allows for most French accents, but the list of accents in the latin character set are IMMENSE! I would prefer to use a white list instead of a black list, in case I miss something.

Note, This will be for C# but I'd like to use the regex for client side validation to be consistent on both sides. I'm HTML encoding the input when I save it to the database as well.

Is there a more elegant way of making the regex accent insensitive, but still being restrictive enough to prevent XSS? I don't want to alienate my users.

I would like to be able to have some punctuation but not open myself up for XSS attacks, for example, I want someone to enter their pany name: If someone worked at Yahoo!, they should be able to sign up.

I'd like to be able to constrain user input to a white list of valid characters, but I don't want to prevent people from other cultures from signing up. So far, I have this:

^[a-zA-Z0-9èéêëàáâãäçìíîïòóôõöùúûü-_]*$

It allows for most French accents, but the list of accents in the latin character set are IMMENSE! I would prefer to use a white list instead of a black list, in case I miss something.

Note, This will be for C# but I'd like to use the regex for client side validation to be consistent on both sides. I'm HTML encoding the input when I save it to the database as well.

Is there a more elegant way of making the regex accent insensitive, but still being restrictive enough to prevent XSS? I don't want to alienate my users.

I would like to be able to have some punctuation but not open myself up for XSS attacks, for example, I want someone to enter their pany name: If someone worked at Yahoo!, they should be able to sign up.

Share Improve this question edited Apr 14, 2011 at 15:41 Dave Harding asked Apr 14, 2011 at 15:24 Dave HardingDave Harding 1,3902 gold badges16 silver badges31 bronze badges 3
  • The ECMAscript RegExp class does not support unicode, beyond the \u.... escape to match a single code point: [ECMA-262 Standard][1]. For example, the \w escape only includes the ASCII letters and digits, plus "_". [1]: ecma-international/publications/files/ECMA-ST/ECMA-262.pdf – odrm Commented Apr 14, 2011 at 15:40
  • Am I going about this the wrong way? I guess the broader question is what's the best validation on the server side to prevent XSS (other than simply HTML encoding everything)? – Dave Harding Commented Apr 14, 2011 at 15:46
  • I'm going to split up the server side functions as having one for only alphanumeric and one with punctuation. Thank you for your help! – Dave Harding Commented Apr 14, 2011 at 17:46
Add a ment  | 

6 Answers 6

Reset to default 2

Maybe you could use unicode range like [\u00C0-\u017E] propably covers all bases for accent (but you should check character map to make sure, as i don't know what accents italian language has).

fwiw: I use a home brew function that returns a RegExp for all diacrits:

function diacritsRegEx(global, caseinsitive, multiline){
        var modifiers =   (global       ? 'g' : '') 
                        + (multiline    ? 'm' : '')
                        + (caseinsitive ? 'i' : ''); 
        return new RegExp(
             ['[\\.\\-a-z\\s]|',            // [a-z, . - and space]
              '[\\300-\\306\\340-\\346]|',  // all accented A, a
              '[\\310-\\313\\350-\\353]|',  // all accented E, e
              '[\\314-\\317\\354-\\357]|',  // all accented I, i
              '[\\322-\\330\\362-\\370]|',  // all accented O, o
              '[\\331-\\334\\371-\\374]|',  // all accented U, u
              '[\\321-\\361]|',             // all accented N, n
              '[\\307-\\347]'               // all accented C, c
             ]
             .join(''), modifiers);
}
^\w+$

Couldn't you just use the alphanumeric flag, I believe that accepts the accents.

In some regex implementations a simple \w will cover all those. See http://www.regular-expressions.info/charclass.html

If you want to allow letter (with diacritics or not) and some punctuation you can use:

^[\w_-]+$

where \w stands for any letter and _- are the 2 allowed extra punctuations allowed. Dont-t forget to put the - at the end is used.

For user input in order form I'm using this: [^\w\s+\/_,.@-] This allows characters for emails, zip-codes, first name, last name etc.

本文标签: cRegex white list for input validationaccent insensitiveStack Overflow