Expand parse without semicolons #1949

aharpervc · 2025-07-16T18:57:35Z

This PR is a followup (ref) to recent work on parsing without requiring semicolon statement delimiters. It's the same content as my former PR which was stale after #1937.

This PR is rebased on latest master and brings forward the increased test coverage and additional parsing fixes from the previous branch.

Particularly, this PR expands support as follows:

Added a supports_statements_without_semicolon_delimiter dialect trait function (default false, but true for SQL Server). Now we can set the default parser option based on the dialect
Tightened up parsing logic, such as for when something is an alias or a keyword and comma delimited lists. When you write T-SQL without semicolons, you're going to have ambiguities, but being more selective here in the parser seems reasonable
More test coverage & expanded test helper functions: statements_without_semicolons_parse_to which deletes semicolons from the string, then parses it (simplifies testing parsing with & without), all_dialects_requiring_semicolon_statement_delimiter & all_dialects_not_requiring_semicolon_statement_delimiter to find configured dialects, and assert_err_parse_statements to simplify asserting "end of statement" vs "an SQL statement" errors. A lot of test assertions that fail when requiring semicolons also fail when not requiring semicolons, but what precisely the parser will complain about could be different (again, if you write SQL this way, you accept that possibility). The helper functions enable running existing tests on dialects that don't require semicolons with minimal changes. The main usage there is in the "common" tests.

aharpervc · 2025-07-16T19:17:51Z

@alamb fyi, as previously discussed

- a corresponding `supports_statements_without_semicolon_delimiter` Dialect trait function - this is optional for SQL Server, so it's set to `true` for that dialect - for the implementation, `RETURN` parsing needs to be tightened up to avoid ambiguity & tests that formerly asserted "end of statement" now maybe need to assert "an SQL statement" - a new `assert_err_parse_statements` splits the dialects based on semicolon requirements & asserts the expected error message accordingly

- at least all of select/insert/update/delete (plus exec) can be added

…icolons

alamb

Thanks @aharpervc -- sorry for the delay in reviewing. I left some comments. Let me know what you think

alamb · 2025-07-21T18:20:54Z

src/dialect/mssql.rs

@@ -123,6 +123,10 @@ impl Dialect for MsSqlDialect {
        true
    }

+    fn supports_statements_without_semicolon_delimiter(&self) -> bool {


🤦 MS SQL!!!!

🤷‍♂️🤣

alamb · 2025-07-21T18:22:08Z

src/dialect/mod.rs

@@ -1136,6 +1142,11 @@ pub trait Dialect: Debug + Any {
    fn supports_notnull_operator(&self) -> bool {
        false
    }
+
+    /// Returns true if the dialect supports parsing statements without a semicolon delimiter.


Would it be possible to give an example here?

Something like

Suggested change

/// Returns true if the dialect supports parsing statements without a semicolon delimiter.

/// Returns true if the dialect supports parsing statements without a semicolon delimiter.

///

/// If returns true, the following SQL will not parse. If returns `false` the SQL will parse

///

/// ```sql

/// SELECT 1

/// SELECT 2

/// ```

alamb · 2025-07-21T18:23:22Z

src/parser/mod.rs

@@ -266,6 +266,22 @@ impl ParserOptions {
        self.unescape = unescape;
        self
    }
+
+    /// Set if semicolon statement delimiters are required.


What is the usecase to have the same configuration information on the ParserOptions and the Dialect?

It seems there are a few places in the Parser that inconsistently check the options and/or the dialect flag. If we had only one way to specify the option, there would not be any room for inconsistencies

Or maybe I misunderstand what is going on here

What is the usecase to have the same configuration information on the ParserOptions and the Dialect?

It's similar to the existing with_trailing_commas logic, and has several practical use cases:

can be set to true for SQL Server & left false for all others (until if/when someone discovers this is supported in other dialects)

integration with existing testing patterns. eg, we can have all_dialects_requiring_semicolon_statement_delimiter & all_dialects_not_requiring_semicolon_statement_delimiter

used to set the corresponding default parser option

It seems like testing parsing without semicolons would get a lot worse if it wasn't a dialect option

Ah yeah the with_trailing_commas option was introduced as parser setting originally and later had to be added as a dialect option since it was dialect specific, I figure ideally it would have been present only on the dialect.

My current thinking is be that ideally we have one way to configure the semicolon option since it would be nice to not have two knobs for the flag. In terms of which way to configure, I'm guessing it could make more sense to have the flag as a parser setting - i.e. whether or not semicolon is required is more reflecting the tool that originally processed the input file, vs an aspect of the sql language/dialect itself, so that it can differ in certain contexts for the same dialect.

In terms of which way to configure, I'm guessing it could make more sense to have the flag as a parser setting

This seems reasonable to me in terms of philosophy. However, I don't know how that could be implemented practically. I.e., how would you write that code in this project to be able to have the same test coverage?

The benefit of setting the parse option based on the dialect configuration is:

not all dialects support parsing without semicolons (so far, only one)

for dialects that require semicolons, we probably don't want anything to change (as is implemented here in the PR)

this project's tests are oriented around dialect features, so implementing the semicolon requirement as a dialect feature makes it more clear for how to run the tests

alamb · 2025-07-21T18:23:57Z

src/parser/mod.rs

@@ -4541,6 +4561,18 @@ impl<'a> Parser<'a> {
            return Ok(vec![]);
        }

+        if end_token == Token::SemiColon
+            && self


do we also have to check the parser options here too?

It might be that we should only be checking the parser options. I'll investigate

Switched to parser options 👍

alamb · 2025-07-21T18:25:52Z

src/parser/mod.rs

+                | Expr::Cast { .. }
+                | Expr::Convert { .. }
+                | Expr::Subquery(_) => Ok(expr),
+                // todo: how to retstrict to variables?


This seems more like a semantic check -- this crate is focused on syntax parsing.

Read more hre:

https://github.com/apache/datafusion-sqlparser-rs?tab=readme-ov-file#syntax-vs-semantics

I think the error is inappropriate in this case

I think the error is inappropriate in this case

What do you suggest?

This seems more like a semantic check -- this crate is focused on syntax parsing.

The difficulty here is that without semicolon tokens the return statement is ambiguous. Consider the following perfectly valid T-SQL statement (which is also a test case example) which has several variations of return:

CREATE OR ALTER PROCEDURE example_sp AS IF USER_NAME() = 'X' RETURN IF 1 = 2 RETURN (SELECT 1) RETURN CONVERT(INT, 123)

If you revert this change and run that test, you get this error:

thread 'test_supports_statements_without_semicolon_delimiter' panicked at tests/sqlparser_mssql.rs:2533:14: called `Result::unwrap()` on an `Err` value: ParserError("Expected: an SQL statement, found: 1 at Line: 1, Column: 72")

The reason is that for the first return, the self.maybe_parse(|p| p.parse_expr())? "successfully" parses/consumes the IF keyword. However, this is contrary to the keyword documentation for SQL Server (ref) which requires an "integer expression".

I think I had said when introducing parse_return in a previous PR that we'd have to come back and tighten it up eventually, but I can't find that discussion 😐.

@aharpervc in the example, is the ambiguity stemming from tsql not supporting IF expressions but only IF statements? If so we could flag that difference in the parser dialect, avoiding this diff. If it does support IF statements then the grammar indeed looks either ambiguous or relying on the newline character to differentiate between the two scenarios?

I don't see this being about the if statement, but rather about the return statement. The question for the parse is, given a return keyword, how many more tokens (or similarly, which expr's) should be considered part of that ReturnStatement? For T-SQL, there is a restricted answer to that question, but it is somewhat awkward to express here.

if it does support IF statements then the grammar indeed looks either ambiguous or relying on the newline character to differentiate between the two scenarios?

Newlines don't really help you here, I put them in the example above for readability, but this should parse identically (and does if you run on a real SQL Server instance):

CREATE OR ALTER PROCEDURE example_sp AS IF USER_NAME() = 'X' RETURN IF 1 = 2 RETURN (SELECT 1) RETURN CONVERT(INT, 123)

Ah right, makes sense that they're not newline dependent. To clarify what I meant in the IF scenario the idea would be to look whether the next token can not be an expression - from a look at their docs the options seem limited to
https://learn.microsoft.com/en-us/sql/t-sql/language-elements/control-of-flow?view=sql-server-ver15

if self.peek_one_of_keywords( // hmm noticed while enumerating this list that the only ambiguous case really // seems to be IF - since offhand the others can't be an expression. // So that the list might not be necessary after all. [BEGIN, BREAK, CONTINUE, GOTO, IF, RETURN, THROW, TRY, WAITFOR, WHILE] ) { // bare return statement } // else maybe_parse_expr

would something like this work?

alamb · 2025-07-21T18:27:10Z

src/test_utils.rs

+            .replace(" ;", " ")
+            .replace(";\n", "\n")
+            .replace("\n;", "\n")
+            .replace(";", " ");


wouldn't this last statement (replace ";" with " " handle all the above? Or is the idea that newlines are important ?

Can you please add comments explaining the rationale for these substitutions?

I'm preserving the newline/whitespace around the statement delimiter because I'm trying to avoid being opinionated on whether they're meaningful or not in this helper. i.e., the helper should only strip semicolons and should avoid other transformations. I can add a comment

Good point. At some point I was probably testing this in conjunction with #1809 and might have gotten it from there. Regardless, I'll simplify it here per your suggestion.

alamb · 2025-07-21T18:28:23Z

tests/sqlparser_common.rs

@@ -272,20 +275,39 @@ fn parse_insert_default_values() {
        "INSERT INTO test_table DEFAULT VALUES (some_column)";
    assert_eq!(
        ParserError::ParserError("Expected: end of statement, found: (".to_string()),
-        parse_sql_statements(insert_with_default_values_and_hive_after_columns).unwrap_err()
+        all_dialects_requiring_semicolon_statement_delimiter()


this query doesn't seem to have multiple statements. Why does the test need to be changed?

Turning on the option to make semicolons optional results in this test failures:

thread 'parse_insert_default_values' panicked at src/test_utils.rs:101:21: assertion `left == right` failed: Parse results with PostgreSqlDialect are different from MsSqlDialect left: Err(ParserError("Expected: end of statement, found: (")) right: Err(ParserError("Expected: SELECT, VALUES, or a subquery in the query body, found: some_column"))

Since that's a parser error in both cases (but different message) I split the assertion out into two -- one for semicolons required that asserts the former error, and one for semicolons optional that asserts the other error.

src/parser/mod.rs

- the dialect's original support informs the parser option, but the parser behavior itself should just check it's own options

iffyio · 2025-08-01T11:10:13Z

src/dialect/mod.rs

+    /// When the dialect supports statements without semicolon delimiter, actual keywords aren't parsed as aliases.
+    fn is_table_factor_alias(&self, explicit: bool, kw: &Keyword, _parser: &mut Parser) -> bool {
+        if self.supports_statements_without_semicolon_delimiter() {
+            kw == &Keyword::NoKeyword
+        } else {
+            explicit || self.is_table_alias(kw, _parser)
+        }


Not sure I followed the intent here, would you be able to provide example sqls to illustrate and how they should be parsed with vs without the flag?

Yes, the example sql is from the sqlparse_common test case parse_select_with_table_alias_keyword & parse_invalid_limit_by, such as SELECT a FROM lineitem DECLARE. Commenting out the new logic and running the tests shows the difficulty. My thinking here is that if semicolons are optional, we can forbid keywords as table aliases. Folks who use this option are going to be opting in to a certain level of ambiguity regardless, but having the parser do reasonable things (with test coverage) is my objective here.

Looking at this code again I think this should be checking the actual configured parser option rather than the dialect's capability, similar to the other feedback comment. I will change it.

iffyio · 2025-08-01T11:44:29Z

src/parser/mod.rs

@@ -266,6 +266,22 @@ impl ParserOptions {
        self.unescape = unescape;
        self
    }
+
+    /// Set if semicolon statement delimiters are required.


Ah yeah the with_trailing_commas option was introduced as parser setting originally and later had to be added as a dialect option since it was dialect specific, I figure ideally it would have been present only on the dialect.

My current thinking is be that ideally we have one way to configure the semicolon option since it would be nice to not have two knobs for the flag. In terms of which way to configure, I'm guessing it could make more sense to have the flag as a parser setting - i.e. whether or not semicolon is required is more reflecting the tool that originally processed the input file, vs an aspect of the sql language/dialect itself, so that it can differ in certain contexts for the same dialect.

iffyio · 2025-08-01T11:46:16Z

src/parser/mod.rs

+                | Expr::Cast { .. }
+                | Expr::Convert { .. }
+                | Expr::Subquery(_) => Ok(expr),
+                // todo: how to retstrict to variables?


@aharpervc in the example, is the ambiguity stemming from tsql not supporting IF expressions but only IF statements? If so we could flag that difference in the parser dialect, avoiding this diff. If it does support IF statements then the grammar indeed looks either ambiguous or relying on the newline character to differentiate between the two scenarios?

aharpervc force-pushed the expand-parse-without-semicolons branch 2 times, most recently from 069ce0e to 9be2dad Compare July 16, 2025 19:15

aharpervc marked this pull request as ready for review July 16, 2025 19:17

aharpervc added 4 commits July 21, 2025 12:52

Reduce ambiguity for consecutive statements

7c17355

- at least all of select/insert/update/delete (plus exec) can be added

Disallow keywords as table aliases for parsing statements without sem…

b83c245

…icolons

Enable parsing comma lists without semicolons

53d7930

aharpervc force-pushed the expand-parse-without-semicolons branch from 9be2dad to 53d7930 Compare July 21, 2025 16:52

alamb reviewed Jul 21, 2025

View reviewed changes

aharpervc commented Jul 21, 2025

View reviewed changes

src/parser/mod.rs Outdated Show resolved Hide resolved

aharpervc added 4 commits July 21, 2025 17:10

Add more descriptive docs comment

0026bb7

Only check the parser option

75cb98e

- the dialect's original support informs the parser option, but the parser behavior itself should just check it's own options

Simplify semicolon replacement for input string

bbd32ed

Restore former "end of statement" logic

4bde5b4

aharpervc force-pushed the expand-parse-without-semicolons branch from 9a52518 to 4bde5b4 Compare July 21, 2025 22:00

aharpervc requested a review from alamb July 22, 2025 19:01

iffyio reviewed Aug 1, 2025

View reviewed changes

-    /// Returns true if the dialect supports parsing statements without a semicolon delimiter.
+    /// Returns true if the dialect supports parsing statements without a semicolon delimiter.
+    ///
+    /// If returns true, the following SQL will not parse. If returns `false` the SQL will parse
+    ///
+    /// ```sql
+    /// SELECT 1
+    /// SELECT 2
+    /// ```

Expand parse without semicolons #1949

Are you sure you want to change the base?

Expand parse without semicolons #1949

Conversation

aharpervc commented Jul 16, 2025

Uh oh!

aharpervc commented Jul 16, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iffyio Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iffyio Aug 1, 2025 •

edited

Loading