Hello, please can you please suggest a regular expression that match
CheesePizza and all its variants where
one or more of its letters are doubled, tripled or even more times mentioned
case insensitive
and second expression that does the same except between the cheese and pizza is any non alpha character/s like space or - etc. one or more times (maybe it can be more effective if it match any non alpha character - like cheese-!<**pizza up to around 15 spec. characters, i do not expect they would be more persistent in their spam attempts).
?
Examples that should match:
1)
ccheesepizza
cHeesepiiiiiiiiizzaaaaaa
2)
ccheese pizza
cHeese------piiiiiiiiizzaaaaaa
or if you have better idea…
I was asking searching on SE but my english vocabulary is not perfect on this subject. Also I’ve asked ChatGPT, but i am unable to confirm that its provided regexes works.
This was the Linux bash script that I have used to confirm and it never found a match:
# Prompt the user for input
#read -p "Enter a string: " input
input="CheesePizzza"
# Regular expression to match (change this according to your needs)
regex="^(?=.*C)(?=.*h)(?=.*e)(?=.*s)(?=.*e)(?=.*P)(?=.*i)(?=.*z)(?=.*a).*$"
regex="^(?=.*C{1,})(?=.*h{1,})(?=.*e{1,})(?=.*s{1,})(?=.*e{1,})(?=.*P{1,})(?=.*i{1,})(?=.*z{1,})(?=.*a{1,})(?=.*$).*$"
regex="^(?=.*C)(?=.*h)(?=.*e)(?=.*s)(?=.*e)(?=.*P)(?=.*i)(?=.*z)(?=.*a)(?=.*$).*$"
# Use awk to check if the input matches the regular expression
if echo "$input" | awk -v regex="$regex" 'BEGIN{ if (match($0, regex)) exit 0; exit 1 }'; then
echo "Input matches the regular expression."
else
echo "Input does not match the regular expression."
fi
I wanted to test the regular expression before i add it to the file which my awk is using to remove the messages:
LC_ALL=C awk … -v repatterns=“regexfile-one-regex-per-line” …
I hope that based on your regex suggestion, i will be able to adjust regex also for another SPAM phrases. Thank you
The expression cc* means match one or more occurrences of the character c. The tolower function converts the input string ($1) to lowercase, so you get case-insensitive matching.
For your second expression, to match punctuation between “cheese” and “pizza”, add [[:punct:]]* which means zero or more punctuation characters.
Thank you a lot - it seems to be working when used against echo output. Yet i am unable to integrate the “tolower($) ~” part into my bash script which is using regular expressions list from certain file and checking existence of such phrases in another file. When you execute following bash script it finds no match due to case sensitivity matching.
Anyone please knows what to replace by what in order it is case insensitive regex match?
echo "cc*hh*ee*ee*ss*ee*pp*ii*zz*zz*aa*" > /dev/shm/repatterns-regexes-list && echo -e "REpatterns file: $(cat /dev/shm/repatterns-regexes-list)"
echo "abc cheesePizzaaaaaa abc" > /dev/shm/file-to-check && echo -e "File to check: $(cat /dev/shm/file-to-check)"
touch /dev/shm/plainphrases-list
# Pass plain phrases file and RE patterns file to awk variables
unwanted=$(LC_ALL=C awk -v plainphrases="/dev/shm/plainphrases-list" -v repatterns="/dev/shm/repatterns-regexes-list" '
BEGIN {
# Plain array
while (getline line < plainphrases) { pp[++ppi]=line }
# RE array
while (getline line < repatterns) { rp[++rpi]=line }
}
# Main loop
{
# Plain matches, eliminate duplicates
for (p in pp) if (index($0, pp[p])) { uniq[pp[p]] }
# RE matches, eliminate duplicates
for (r in rp) if (match($0, rp[r])) { uniq[substr($0, RSTART, RLENGTH)] }
}
END {
# Print the collected results
for (u in uniq) { print u }
}
' "/dev/shm/file-to-check")
if [[ "$unwanted" != "" ]]; then
echo "Found regex matching phrase.";
else echo "Not found regex matching phrase."
fi
pm me to trade
samples to join
sample to join
'"\/;� bad phrase
Aim is to check the content of the “file to examine” and if part of that content match one or more phrases from above mentioned two blocklists(may contains thousands of phrases) (case insensitively), then set variable:
bad=“here insert first bad phrase it found in input file”
sleep 10 seconds and repeat the whole process.
OK. Next question: Why do you need the actual bad string stored in a variable? (What is the business purpose? What happens to the value of the variable bad next?)
You can detect and print bad lines (without isolating the specific bad string) very simply
with grep -E -i -f and grep -F -i -f.
The above command should extract all the matching strings from file-to-check using the regular expressions in repatterns-regexes-list. Likewise for the plain phrases with grep -F -i -o.
If the bash script finds bad phrases in a file, I need to know what the first found bad phrase is because i need to find two specific phrases near the bad phrase.
But as you are suggesting, maybe i can use just yours mentioned grep commands instead of above mentioned complicated awk command:
Today i have spent hours playing with it and ended up with this script bassed on yours proposed grep commands:
#!/bin/bash
repatternsfile="/dev/shm/repatterns-regexes-list"
literalstringsfile="/dev/shm/plainphrases-list"
filetocheck="/dev/shm/file-to-check"
while true; do
# check file for existence of a phrases listed in files $repatternsfile and $literalstringsfile
time found=$(grep -a -E -i -o -m 1 -f $repatternsfile $filetocheck|head -n 1) # Search using extended regexps and set first found string into a variable
if [[ ! "$found" ]]; then # not found matching phrase from first file, try phrases from another file
time found=$(grep -a -F -i -o -m 1 -f $literalstringsfile $filetocheck|head -n 1) # Search using literal strings and set first found string into a variable
fi
# if variable $found contains phrase (there is a match)
if [[ "$found" != "" ]];then
echo "Matching phrase found: ";echo "$phrase";
fi
sleep 10
done
when metering the execution time or my previous single awk (mentioned in 2nd post above) and current two greps (shown above in this post). The greps are often like 20-30% slower, but with greps, i am able to do case insensitive match which is needed. So it seems to be working for me, thank you.
If anyone is interested in regular expression, which is the topic of this discussion thread, i have came with the following thanks to @dbauthor and these seems to be working well with above mentioned script and i think it will be unpleasant for the paedo//philes.
cat /dev/shm/repatterns-regexes-list
# WARNING 1: messages containing following phrases will be deleted
# WARNING 2: if you match "bad" then it will delete for example messages containing "Sinbad."
# NOTE: \[[:alpha:\]] alpha letter/s, \[[:digit:\]] digit/s, \[[:alnum:\]] Any alphanumeric/s, \[[:graph:\]] any printable, \[[:print:\]] any printable and space
# NOTE: Letters that may be doubled, trippled etc. by the spammer, write like pp*ii*zz*zz*aa* (matches for example pppiizzaaaaaaaa)
bounce[[:punct:]]money
bouncedd*oo*tt*money
bounce[[:space:]]dd*oo*tt*[[:space:]]money
# CHESE:
# chese piza
cc*hh*ee*ss*ee*pp*ii*zz*aa*
cc*hh*ee*ss*ee*[[:punct:]]*pp*ii*zz*aa*
cc*hh*ee*ss*ee*[[:space:]]*pp*ii*zz*aa*
# chese pisa
cc*hh*ee*ss*ee*pp*ii*ss*aa*
cc*hh*ee*ss*ee*[[:punct:]]*pp*ii*ss*aa*
cc*hh*ee*ss*ee*[[:space:]]*pp*ii*ss*aa*
# chese pisza
cc*hh*ee*ss*ee*pp*ii*ss*zz*aa*
cc*hh*ee*ss*ee*[[:punct:]]*pp*ii*ss*zz*aa*
cc*hh*ee*ss*ee*[[:space:]]*pp*ii*ss*zz*aa*
# chese pizsa
cc*hh*ee*ss*ee*pp*ii*zz*ss*aa*
cc*hh*ee*ss*ee*[[:punct:]]*pp*ii*zz*ss*aa*
cc*hh*ee*ss*ee*[[:space:]]*pp*ii*zz*ss*aa*
# CHESY:
# chesy piza
cc*hh*ee*ss*yy*pp*ii*zz*aa*
cc*hh*ee*ss*yy*[[:punct:]]*pp*ii*zz*aa*
cc*hh*ee*ss*yy*[[:space:]]*pp*ii*zz*aa*
# chesy pisa
cc*hh*ee*ss*yy*pp*ii*ss*aa*
cc*hh*ee*ss*yy*[[:punct:]]*pp*ii*ss*aa*
cc*hh*ee*ss*yy*[[:space:]]*pp*ii*ss*aa*
# chesy pisza
cc*hh*ee*ss*yy*pp*ii*ss*zz*aa*
cc*hh*ee*ss*yy*[[:punct:]]*pp*ii*ss*zz*aa*
cc*hh*ee*ss*yy*[[:space:]]*pp*ii*ss*zz*aa*
# chesy pizsa
cc*hh*ee*ss*yy*pp*ii*zz*ss*aa*
cc*hh*ee*ss*yy*[[:punct:]]*pp*ii*zz*ss*aa*
cc*hh*ee*ss*yy*[[:space:]]*pp*ii*zz*ss*aa*
# CHEZY:
# chezy piza
cc*hh*ee*zz*yy*pp*ii*zz*aa*
cc*hh*ee*zz*yy*[[:punct:]]*pp*ii*zz*aa*
cc*hh*ee*zz*yy*[[:space:]]*pp*ii*zz*aa*
# chezy pisa
cc*hh*ee*zz*yy*pp*ii*ss*aa*
cc*hh*ee*zz*yy*[[:punct:]]*pp*ii*ss*aa*
cc*hh*ee*zz*yy*[[:space:]]*pp*ii*ss*aa*
# chezy pisza
cc*hh*ee*zz*yy*pp*ii*ss*zz*aa*
cc*hh*ee*zz*yy*[[:punct:]]*pp*ii*ss*zz*aa*
cc*hh*ee*zz*yy*[[:space:]]*pp*ii*ss*zz*aa*
# chezy pizsa
cc*hh*ee*zz*yy*pp*ii*zz*ss*aa*
cc*hh*ee*zz*yy*[[:punct:]]*pp*ii*zz*ss*aa*
cc*hh*ee*zz*yy*[[:space:]]*pp*ii*zz*ss*aa*
# CHESZY
# cheszy piza
cc*hh*ee*ss*zz*yy*pp*ii*zz*aa*
cc*hh*ee*ss*zz*yy*[[:punct:]]*pp*ii*zz*aa*
cc*hh*ee*ss*zz*yy*[[:space:]]*pp*ii*zz*aa*
# cheszy pisa
cc*hh*ee*ss*zz*yy*pp*ii*ss*aa*
cc*hh*ee*ss*zz*yy*[[:punct:]]*pp*ii*ss*aa*
cc*hh*ee*ss*zz*yy*[[:space:]]*pp*ii*ss*aa*
# cheszy pisza
cc*hh*ee*ss*zz*yy*pp*ii*ss*zz*aa*
cc*hh*ee*ss*zz*yy*[[:punct:]]*pp*ii*ss*zz*aa*
cc*hh*ee*ss*zz*yy*[[:space:]]*pp*ii*ss*zz*aa*
# cheszy pizsa
cc*hh*ee*ss*zz*yy*pp*ii*zz*ss*aa*
cc*hh*ee*ss*zz*yy*[[:punct:]]*pp*ii*zz*ss*aa*
cc*hh*ee*ss*zz*yy*[[:space:]]*pp*ii*zz*ss*aa*
# CHEZSY
cc*hh*ee*zz*ss*yy*pp*ii*zz*aa*
cc*hh*ee*zz*ss*yy*[[:punct:]]*pp*ii*zz*aa*
cc*hh*ee*zz*ss*yy*[[:space:]]*pp*ii*zz*aa*
# chezsy pisa
cc*hh*ee*zz*ss*yy*pp*ii*ss*aa*
cc*hh*ee*zz*ss*yy*[[:punct:]]*pp*ii*ss*aa*
cc*hh*ee*zz*ss*yy*[[:space:]]*pp*ii*ss*aa*
# chezsy pisza
cc*hh*ee*zz*ss*yy*pp*ii*ss*zz*aa*
cc*hh*ee*zz*ss*yy*[[:punct:]]*pp*ii*ss*zz*aa*
cc*hh*ee*zz*ss*yy*[[:space:]]*pp*ii*ss*zz*aa*
# chezsy pizsa
cc*hh*ee*zz*ss*yy*pp*ii*zz*ss*aa*
cc*hh*ee*zz*ss*yy*[[:punct:]]*pp*ii*zz*ss*aa*
cc*hh*ee*zz*ss*yy*[[:space:]]*pp*ii*zz*ss*aa*
( adding [:alpha:] between [:punct:] and [:space:] would match too many words in between )
If anyone is curious how that works, ChatGPT:
c+: Matches the letter “c” one or more times. h+: Matches the letter “h” one or more times. e+: Matches the letter “e” one or more times. [sz]+: Matches either the letter “s” or “z” one or more times. [ey]+: Matches either the letter “e” or “y” one or more times. [[:punct:][:space:]]*: Matches any punctuation or space character zero or more times.