4

This is based on a question previously asked which was deleted by the OP but it got me thinking and playing round with it, and I managed to do it without using a regex (only (I)LIKE() % or _ allowed).

  • The question is also a lesson in the perils of allowing users to enter free text!

However, my solution is quite cumbersome and I've wracked my brains to see if I could find a more elegant solution, but no joy.

I'm not going to put my own solution forward yet.

A typical string might be like this:

'First order 437.3/10-87 16NY100013XX55 - Return'

So, the only thing you can be sure of is that the string (order_name) starts with '437.' and that it ends with the last digit in the string.

  • N.B. The text following the order is arbitrary: its only constraint is that it can't contain a digit!

So, the result for this particular example would be

'437.3/10-87 16NY100013XX55'

So, my table and data are below and also on the fiddle here:

CREATE TABLE t1 
(
  id INT NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  order_name TEXT NOT NULL
);

strings:

INSERT INTO t1 (order_name) VALUES
('First order 437.8/03-87 22LA190028YV7ER55'),
('Second order 437.8-03-87 22LA19'),
('First order 437.3/10-87 16NY100013XX55 - Return'),
('Order 21.02.2022 437.8/10-87 16WA239766'),
('437.8/10-87 16NY10023456YY78 - Paid'),
('First order (437.8/03-87 22LA190028)'),
('Visit 02.02.2023 Order 437.5/10-87 16DC107765X56 REFUND'),
('Visit 02.02.2023 Order 437.5/10-87 16DC1077657FFR56REFUND'),    
('Visit 02.02.2023 Order 437.5/10-87 16DC107765745 - Reorder');

Desired result:

Order Text
437.8/03-87 22LA190028YV7ER55
437.8-03-87 22LA19
437.3/10-87 16NY100013XX55
437.8/10-87 16WA239766
437.8/10-87 16NY10023456YY78
437.8/03-87 22LA190028
437.5/10-87 16DC107765X56
437.5/10-87 16DC1077657FFR56
437.5/10-87 16DC107765745

The fiddle is PostgreSQL 16, but I've checked and it works back to version 10. The question is PostgreSQL based, but I'm open to elegant solutions from other systems.

9
  • 1
    Is there a specific reason to avoid regular expressions? Commented Jun 11, 2024 at 10:13
  • I just want to do it that way! With a regex, it's trivial - but, as I said, I've found a way of doing it without one and I'd just like to see if there is a better way than the one I've found - that's why I'm going to apply a bonus (no matter what - if you answer now, I'll still wait until the bonus comes up before accepting an answer). Just on a whim one might say! :-) Commented Jun 11, 2024 at 10:39
  • 1
    Re I just want to do it that way: Almost always, less code is good. Unless you need to not use regex (many DBs don't support it), go for the "less code" (but readable) solution. Commented Jun 12, 2024 at 8:48
  • 1
    @Zegarek Easier to whitelist, only (I)LIKE() % or _ allowed. The regex engine is overhead that isn't present in the simple POSITION(), SUBSTRING(), REPLACE() (or others TRANSLATE() &c...) - the general advice is (and my own ad-hoc testing bears this out) is that regexes should only be used when all other avenues have been exhausted. I just became intrigued by this question for some reason (sad life, nothing else to do... don't ask! :-) ) - I'll set up a benchmark and include a regex solution and we can see - based on tests and facts. Gross breach of SO tradition, but hey! Have a go? Commented Jun 14, 2024 at 11:03
  • 1
    @Vérace It was tricky to dig this up: I tried to recall a thread with at least one self-answer and at least one answer with a bounty, with postgres and regex tags. I only caught it when I narrowed it down to recent questions with a regex in just the body and a bounty that was added by not necessarily awarded. I'm still somewhat curious about the code you mentioned you were planning to share. Commented Aug 28, 2024 at 16:35

7 Answers 7

3

Find the positions of '437.' and the last digit in the order_name string (with a scalar subquery) to use with substring function.

select substring(
  order_name, 
  position('437.' in order_name), 
  (
   select max(o)::integer 
   from string_to_table(order_name, null) with ordinality as t(c, o)
   where c between '0' and '9'
  ) - position('437.' in order_name) + 1
) from t1;

Fiddle

Sign up to request clarification or add additional context in comments.

2 Comments

It seems to be a clever way of removing the letters, and keeping the number, by first replacing all numbers by a 9 (and looking from the end of the string because you use reverse()). see: dbfiddle.uk/c0KJxZ0-
short, easy to read, immediately apparent how the query works, easily adaptable for that reason. Very nice.
2

In order to extract string without using regular expressions, I would use string functions like POSITION, and SUBSTRING. I have two solutions:

Solution 1:

WITH ExtractedOrders AS (
  SELECT
    SUBSTRING(order_name FROM POSITION('437.' IN order_name)) AS raw_order_text
  FROM t1
)
SELECT
  TRIM(
    BOTH ' - ' FROM
    CASE
      WHEN raw_order_text LIKE '%Paid' THEN LEFT(raw_order_text, LENGTH(raw_order_text) - LENGTH('Paid'))
      WHEN raw_order_text LIKE '%REFUND' THEN LEFT(raw_order_text, LENGTH(raw_order_text) - LENGTH('Refund'))
      WHEN raw_order_text LIKE '%Reorder' THEN LEFT(raw_order_text, LENGTH(raw_order_text) - LENGTH('Reorder'))
      WHEN raw_order_text LIKE '%Return' THEN LEFT(raw_order_text, LENGTH(raw_order_text) - LENGTH('Return'))
      WHEN raw_order_text LIKE '%)' THEN LEFT(raw_order_text, LENGTH(raw_order_text) - LENGTH(')'))
      ELSE raw_order_text
    END
  ) AS "Order Text"
FROM ExtractedOrders;

Solution 2:

WITH ExtractedOrders AS (
  SELECT
    id,
    SUBSTRING(order_name FROM POSITION('437.' IN order_name)) AS raw_order_text
  FROM t1
)
SELECT
  TRIM(BOTH ' -' FROM cleaned_order_text) AS "Order Text"
FROM (
  SELECT
    id,
    TRIM(BOTH ' -' FROM
      REPLACE(
        REPLACE(
          REPLACE(
            REPLACE(
              REPLACE(raw_order_text, 'Paid', ''), 
              'REFUND', ''
            ), 
            'Reorder', ''
          ), 
          'Return', ''
        ), 
        ')', ''
      )
    ) AS cleaned_order_text
  FROM ExtractedOrders
) AS cleaned_orders;

Edit

I had understood your question a little differently, my assumption was the ending text is consistent and can be handled via hard code methods.

As per your comment, changed my approach to checked min int index on string reverse and than fetched substring via lengths and indices. Here is the new solution:

WITH RECURSIVE check_integers AS (
  SELECT
    order_name,
    reverse(order_name) AS reversedStr,
    1 AS idx,
    substring(reverse(order_name) from 1 for 1) AS current_char,
    CASE
      WHEN substring(reverse(order_name) from 1 for 1) ~ '\d' THEN true
      ELSE false
    END AS is_int
  FROM t1
  UNION ALL
  SELECT
    order_name,
    reversedStr,
    idx + 1 AS idx,
    substring(reversedStr from idx + 1 for 1) AS current_char,
    CASE
      WHEN substring(reversedStr from idx + 1 for 1) ~ '\d' THEN true
      ELSE false
    END AS is_int
  FROM
    check_integers
  WHERE
    idx < length(reversedStr)
)
SELECT
  substring(a.order_name from POSITION('437.' IN a.order_name) for (length(a.order_name) - POSITION('437.' IN a.order_name) - idx + 2))  as orderText
FROM (  
  SELECT
    min(idx) as idx,
    order_name
  FROM
    check_integers
  WHERE
    is_int = true
  GROUP BY
    order_name
) a;

2 Comments

I'm afraid that the text following the order_name can be arbitrary - I'll edit the question so that is more clear - I mentioned lesson in the perils of allowing users to enter free text! - the only constraint about the text following is that it can't contain digits!
Oh sorry for the misunderstanding, i am posting my new solution in my answer edit, hope that is what you were looking for.
0

EDIT: Originally I was looking for the first digit after the word "Order "because I missed the part where the Order starts with "437". I've updated the answer but left the slightly more complex logic of finding the start position of the search

This is a TSQL solution

  • I want to find the first and last digit positions within the string.
  • I'll search for the first position either after the occurrence of the word "437" or the start of the string in its absence.
  • The last position is calculated by reversing the string
  • I'm using CROSS APPLY with VALUES so I can refer to the previous calculations without repeating the code (for readability)

The only tricky thing should be off-by-one errors

SELECT 
    *
FROM dbo.t1 AS t
CROSS APPLY (values ( COALESCE(NULLIF(PATINDEX('%437%', t.order_name), 0), 1))) AS ca1(start_pos) /* start search from beginning or first "order" */
CROSS APPLY (values (
    ca1.start_pos + PATINDEX('%[0-9]%', SUBSTRING(t.order_name, ca1.start_pos, LEN(t.order_name))) -1 /* find first digit from start_pos */
    , LEN(t.order_name) - PATINDEX('%[0-9]%', REVERSE(t.order_name)) +1 /* find first digit in reversed string */
    )) AS ca2(first_digit_pos, last_digit_pos)
CROSS APPLY (values ( SUBSTRING(t.order_name, ca2.first_digit_pos, ca2.last_digit_pos-ca2.first_digit_pos +1))) AS ca3(isolated_order)

Result enter image description here

Comments

0

Just selecting the substring starting at the position of 437., and ending on the second space after the start of that:

SELECT
  order_name,
  substring(attempt1,1,case when p2=0 then 500 else p1+p2 end) as substring
FROM (
  SELECT 
    order_name,
    attempt1,
    position(' ' in attempt1) as p1,
    position(' ' in substring(attempt1, 1+position(' ' in attempt1))) as p2
  FROM (
    SELECT
      order_name, 
      substring(order_name,position('437.' IN order_name)) as attempt1
    FROM t1
  )
)

output:

order_name substring
First order 437.8/03-87 22LA190028YV7ER55 437.8/03-87 22LA190028YV7ER55
Second order 437.8-03-87 22LA19 437.8-03-87 22LA19
First order 437.3/10-87 16NY100013XX55 - Return 437.3/10-87 16NY100013XX55
Order 21.02.2022 437.8/10-87 16WA239766 437.8/10-87 16WA239766
437.8/10-87 16NY10023456YY78 - Paid 437.8/10-87 16NY10023456YY78
First order (437.8/03-87 22LA190028) 437.8/03-87 22LA190028)
Visit 02.02.2023 Order 437.5/10-87 16DC107765X56 REFUND v437.5/10-87 16DC107765X56
Visit 02.02.2023 Order 437.5/10-87 16DC1077657FFR56REFUND 437.5/10-87 16DC1077657FFR56REFUND
Visit 02.02.2023 Order 437.5/10-87 16DC107765745 - Reorder 437.5/10-87 16DC107765745

see: DBFIDDLE

EDIT: Removing the REFUND, or the ), using regular expressions is easy, just to:

regexp_replace(substring(attempt1,1,case when p2=0 then 500 else p1+p2 end),'[^0-9]*$','','g') as substring2

When doing this without regular expression, might need a user defined function, especially when you do not want to hard code stuff like REFUND or REORDER

see: DBFIDDLE

EDIT2: To Remove letters, and other non-numeric characters, from the right side of the string, I created this function:

CREATE OR REPLACE FUNCTION public.trimlettersfromrightofstring(s varchar(200))
    RETURNS varchar(200)
    LANGUAGE sql
AS $function$
with recursive abc as (
      SELECT s as a
    ), abc2(a,b,c) as (
      select 
         a,
         rtrim(a,right(trim(a),1)) as b, 
         1 as c 
      from abc
      
      union all
      
      select 
         b,
         rtrim(trim(b),right(trim(b),1)), 
         c+1
      from abc2 
      where right(trim(b),1) <'0' or right(trim(b),1) > '9'  
    )
    SELECT b from abc2 order by c desc limit 1;
$function$

This can be used as SELECT trimlettersfromrightofstring('1234abcdef5hijk');, which returns 1234abcdef5

This is the link to the adapted DBFIDDLE, which produced the desired results.

4 Comments

Doesn't work for the second last record - no space between the last digit and refund!
Ok, I will have a look at that ..... 😉
@Vérace: Added a FUNCTION to remove the non-numeric stuff at the end of the substring. (without using regular expressions)
P.S. This reminds me to do more things with FUNCTIONS and/or PROCEDURES, because my experience on that parts is not good enough to simply write something which works in 1 time..... 🤔😕
0

My version with fancy finding the position of the last digit (position returns 0 when substring is not present) ;)

WITH indexes AS (
    WITH reversed AS (
        SELECT
            order_name,
            reverse(order_name) AS order_name_reversed
        FROM
            t1
    )
    SELECT
        order_name,
        position('437.' IN order_name) AS start,
        length(order_name) -
        least(
            CASE WHEN position('0' IN order_name_reversed) = 0 THEN NULL ELSE position('0' IN order_name_reversed) END,
            CASE WHEN position('1' IN order_name_reversed) = 0 THEN NULL ELSE position('1' IN order_name_reversed) END,
            CASE WHEN position('2' IN order_name_reversed) = 0 THEN NULL ELSE position('2' IN order_name_reversed) END,
            CASE WHEN position('3' IN order_name_reversed) = 0 THEN NULL ELSE position('3' IN order_name_reversed) END,
            CASE WHEN position('4' IN order_name_reversed) = 0 THEN NULL ELSE position('4' IN order_name_reversed) END,
            CASE WHEN position('5' IN order_name_reversed) = 0 THEN NULL ELSE position('5' IN order_name_reversed) END,
            CASE WHEN position('6' IN order_name_reversed) = 0 THEN NULL ELSE position('6' IN order_name_reversed) END,
            CASE WHEN position('7' IN order_name_reversed) = 0 THEN NULL ELSE position('7' IN order_name_reversed) END,
            CASE WHEN position('8' IN order_name_reversed) = 0 THEN NULL ELSE position('8' IN order_name_reversed) END,
            CASE WHEN position('9' IN order_name_reversed) = 0 THEN NULL ELSE position('9' IN order_name_reversed) END
        ) + 1 AS last_digit
    FROM
        reversed
)
SELECT
    *,
    substring(order_name, start, last_digit - start + 1)
FROM
    indexes;

And this is the result: result

Comments

0

Since version 16 I would prefer to use RECURSIVE CTE to get all starting positions and last number positions of all rows and then process it in the main sql. Mainly, it is because of readability and maintainability of the code which is the main reason why to use CTE(s) anyway.
Downsides: It will not work on versions before 16. Also, with very large tables, this could be a bit performance costly in which case it should be compared to benefites of readability/mainainability and decided to use it or not.

WITH     
  RECURSIVE 
  cte AS (  SELECT id, order_name, 0 AS nmbr,
                   POSITION('437.' In order_name) as start_pos, 
                   Length(order_name) - POSITION(Cast(0 as VarChar) In reverse(order_name)) + 1 as last_nmbr_pos
            FROM   t1
          UNION  ALL
            SELECT c.id, c.order_name, c.nmbr + 1,
                   POSITION('437.' In order_name), 
                   Length(c.order_name) - POSITION(Cast(c.nmbr + 1 as VarChar) In reverse(c.order_name)) + 1
            FROM   cte      c
            WHERE  nmbr < 9 
        ) 

The code above generates 10 rows for each order_name (with numbers 0, 1, 2, ..., 8, 9) and finds starting position of an order text as the position of '437,' as well as the last position of each number (0-9) in every order_name. Reverse is used to get last of possible repeating numbers (like 55). Actual position is calculated like Length(order_name) - position of particular number in reversed order_name + 1.
This recursive cte is used as subquery in the code below where max_last_nmbr_pos column is added (using Max() Over() analytic function) so we could fetch just the rows with correct number beeing the last in order_name (see the outer Where clause). Now, having just the right rows with start and end positions of order text the SubStr() function extracts actual order text out of order_name column.

--    M a i n    S Q L :
SELECT    id,  
          SubStr(order_name, start_pos, last_nmbr_pos - start_pos + 1) as order_text
FROM     ( Select    id, order_name, nmbr, start_pos, last_nmbr_pos,
                     Max(last_nmbr_pos) Over(Partition By id)  as max_last_nmbr_pos
           From cte
           Where POSITION(Cast(nmbr as VarChar) In order_name) > 0 
         )
WHERE    last_nmbr_pos = max_last_nmbr_pos

The Where clause of the subquery above excludes the rows of recursive cte where particular number (0-9) does not exist in order_name.

/*     R e s u l t : 
id  order_text
--  ----------------------------------
 1  437.8/03-87 22LA190028YV7ER55
 2  437.8-03-87 22LA19
 3  437.3/10-87 16NY100013XX55
 4  437.8/10-87 16WA239766
 5  437.8/10-87 16NY10023456YY78
 6  437.8/03-87 22LA190028
 7  437.5/10-87 16DC107765X56
 8  437.5/10-87 16DC1077657FFR56
 9  437.5/10-87 16DC107765745           */

See the fiddle here.

Comments

0

Main issue is find last digit in order_name. We can translate all digits to one, for example '9', and search first in reverse order.

See example

select id
  ,substring(ordn,1
      ,length(ordn)
        -position('9' in translate(reverse(ordn),'0123456789','9999999999'))+1
    ) as short_num
from (
  select *
    ,substring(order_name FROM position('437.' in order_name)) ordn
  from t1
)t2

Use char_length, if needed.
Fiddle

Short query

select *
  ,substring(order_name,position('437.' in order_name)
    ,length(order_name)-position('437.' in order_name)+1
     -position('9'in translate(reverse(order_name),'0123456789','9999999999'))+1
  ) as short_num
from t1
id order_name short_num
1 First order 437.8/03-87 22LA190028YV7ER55 437.8/03-87 22LA190028YV7ER55
2 Second order 437.8-03-87 22LA19 437.8-03-87 22LA19
3 First order 437.3/10-87 16NY100013XX55 - Return 437.3/10-87 16NY100013XX55
4 Order 21.02.2022 437.8/10-87 16WA239766 437.8/10-87 16WA239766
5 437.8/10-87 16NY10023456YY78 - Paid 437.8/10-87 16NY10023456YY78
6 First order (437.8/03-87 22LA190028) 437.8/03-87 22LA190028
7 Visit 02.02.2023 Order 437.5/10-87 16DC107765X56 REFUND 437.5/10-87 16DC107765X56
8 Visit 02.02.2023 Order 437.5/10-87 16DC1077657FFR56REFUND 437.5/10-87 16DC1077657FFR56
9 Visit 02.02.2023 Order 437.5/10-87 16DC107765745 - Reorder 437.5/10-87 16DC107765745

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.