Script Construction and Common Errors that can occur

Things to Consider When Writing Queries

A pair of dashes "- -" denotes the beginning of a comment.

The number of rows for an immediate query is limited to 10,000.

If the query output exceeds this limit then submit the query as queued or reduce the size of the area of interest (AOI).

The scope of each query can be altered by changing the area symbol or by adding additional conditions.

The area symbol corresponds to the soil survey area code not the county FIPS code.
e.g. Ste. Genevieve County, Missouri the FIPS code is MO186 and the soil survey area symbol is MO193.
The area symbol for some multi-county surveys or partially mapped county surveys is in the 600s.
The STATSGO area is identified by 'US' and SSURGO maps are identified by 'non-MLRA soil survey areas'. If you do not specify an area symbol, the query will probably timeout or exceed the limit for returned rows.
When running queries against the whole nation, be sure to specify STATSGO or SSURGO; otherwise, you will get both area types and they are not easily distinguished from each other.
To get STATSGO only use area symbol='US'.
To get all SSURGO areas use area symbol<>'US'.

Wild cards can be used in a query to filter the data. Examples are shown in the tables below.

The following link takes you to a list of all the Table and Column names in the SURRGO database: Table And Column Name Report

Examples of parameters that can be applied to filter your data.

All sample queries are written to retrieve data from only one area e.g. 'NE109', so that they will run immediately. The parameter for the area symbol can be changed. See tables below for examples.

The "Applied Parameter" is the term that you can change in the scripts to alter the area of interest (AOI).
The "Data Returned" is the output in the returned report.

Applied Parameter	Data Returned
areasymbol = 'US'	All STATSGO data
areasymbol <> 'US'	All SSURGO data
areasymbol = 'MO017'	SSURGO data: only for Bollinger County, Missouri
areasymbol LIKE 'MO%'	SSURGO data: all of Missouri
areasymbol IN ('MO017', 'MO035')	SSURGO data: only for Bollinger County, Missouri and Carter County, Missouri

Additional parameters can be applied to queries for map units or any other field.

The tables below show different parameters that can be applied to the map unit symbol and the map unit name.

Applied Parameter	Data Returned
AND m.musym IN('3a', '4b', '6c')	This restricts the query to three map units
AND m.musym = '4b'	This restricts the query to just one map unit
AND m.musym LIKE '4*'	This restricts the query to any map unit that starts with 4
AND m.musym LIKE '[!4]'	This restricts the query to all map units that do not start with 4
AND m.muname LIKE 'Adco%'	This restricts the query to any map unit that starts with Adco

Syntax and Suggested Formats for Efficient Queries

IF you use all INNER JOINs, then the conditions can be in the FROM clause.

If you use an OUTER JOIN, the conditions need to be in the WHERE clause.

All text needs to be bounded by single quotes; double quotes will cause an error.

Use Aliases for the following reasons:

Table and column names are big, it is easier to read and write the scripts.
Table and column names are cryptic and not very readable.
More than one table is involved in a query.
A table is called more than once and joined to itself.
A table is called more than once in subsequent queries.
Two or more tables have the same name for a column, e.g. depths and color fields.
Functions are used in the query.
Two or more columns are combined together

You can change the column headers to be more readable by changing the selected field to an alias e.g. l.areaname as Area_Name.

You can query the component data many different ways: all data, just major components, the dominant component, or the dominant condition.

"All data" retrieves all the data for all the components. This can be very complex and not very useful, depending on the data retrieved.
"All major components" does not query any minor components. Again, this can be a very complex data structure.
"Dominant component" finds the component with the greatest percentage and retrieves only that data.
"Dominant condition" finds the condition that is most prevalent and retrieves that data.

An example of a dominant condition is two components with a restrictive layer that have a combined percentage greater than that of the dominant component, which does not have a restrictive layer.

Only "tune" SQL after you have confirmed the code as working correctly.

Ensure repeated SQL statements are written absolutely identically. This promotes efficient reuse: re-parsing can often be avoided for each subsequent use.

Best practices for writing:

all SQL verbs are in upper-case i.e., SELECT;
all words are separated with a single space;
all SQL verbs begin on a new line;
SQL verbs are aligned right or left within the initial verb;
a table alias standard is to set and maintained;

table aliases are used;
if a query involves more than one table, all column names are prefixed with their aliases

Whatever you do, be consistent.

Code the query as simply as possible; i.e., no unnecessary columns are selected, and GROUP BY or ORDER BY are not used unnecessarily.

It is fast or faster to SELECT by actual column name(s) than to use an asterisk. The larger the table, the more likely you will save time.

E.g., use:

     SELECT muname, musym

rather than:

     SELECT * FROM mapnit;

Perform functions instead of comparisons on the data objects referenced in the WHERE clause.

E.g. use:

    SELECT musym, muname, muacres FROM mapunit WHERE muacres > 0;

rather than:

    SELECT musym, muname, muacres FROM mapunit WHERE muacres!= 0;

Do not use a HAVING clause in SELECT statements. It only filters selected rows after all the rows have been returned. Use HAVING only if summary operations that are applied to columns will be restricted by the clause. A WHERE clause may be more efficient.

E.g., use:

    SELECT musym FROM mapunit WHERE muname!= 'Menfro' AND muname!= 'Goss'; GROUP BY muname;

rather than:

    SELECT musym FROM mapunit GROUP BY muname HAVING muname!= 'Menfro' AND muname!= 'Goss';

When writing a subquery (a SELECT statement within the WHERE or HAVING clause of another SQL statement):

Use a correlated subquery when the return is relatively small and/or other criteria are efficient i.e. if the tables within the subquery have efficient indexes. A correlated subquery refers to at least one value from the outer query.
Use a non-correlated subquery when dealing with large tables from which you expect a large return (many rows) and/or if the tables within the subquery do not have efficient indexes. A non-correlated subquery does not refer to the outer query.
Ensure that multiple subqueries are in the most efficient order.
Remember that rewriting a subquery as a join can sometimes increase efficiency.

Minimize the number of table lookups, especially if SELECT is use in a subquery.

When using multiple table joins consider the benefits and costs for each EXISTS command, IN command, and table joins. Depending on your data, one or another command may be faster.

Note: IN is usually the slowest.
Note: When most of the filter criteria are in the subquery, IN may be more efficient; when most of the filter criteria are in the parent-query, EXISTS may be more efficient.

Where possible, use EXISTS rather than DISTINCT.

Where possible, use a non-column expression (put the column on one side of the operator and all other values on the other). Non-column expressions are commonly processed earlier, thereby speeding up the query.

E.g., use:

    WHERE muacres < 1000/(1 + n);

rather than:

    WHERE muacres + (n * muacres) < 1000;

Try not to use OR in a query:

E.g., use:

    Select compname Where comppct_r=50

      UNION ALL

    Select compname Where comppct_r=30

rather than:

    Select compname

    From component Where (comppct_r=50) OR (comppct_r=30)

If possible, do not use a wildcard in the beginning of the matching pattern "%pattern".

Use SQL Joins instead of using subqueries.

Make the table with the least number of rows the driving table. Do this by making it first in the FROM clause.

The OVER clause with an aggregate function is similar to, but more efficient than, a subquery.

E.g., the aggregate function

    AVG(slope_r) over(partition by compname)

produces the same results as the subquery

    (select AVG(slope_r)from component c1 where c1.compname=component.compname)

A significant difference between the OVER clause and subqueries is that subqueries use ORDER BY for the partitioning column(s); the over clause does not and is less restrictive.
When using GROUP BY, everything in the SELECT clause must be either in the GROUP BY list or an aggregate function. An aggregate function with OVER does not require GROUP BY.

Use an INTO #temp table when:

Data collected from different areas of the database need to be join together.
Data tables that are being queried are in multiple branches of a join statement.
Data concatenates due to multiple linkage joins in the data structure.
Data is collected with different filters and then joined together for comparison.
Data in main query is joined to a secondary query.
Data is filtered through several iterations.

Common Errors that can occur

Double quotes on a text value will cause the error "Invalid query: No end of literal found"
Using the wrong join key will cause the error "Invalid query: Invalid column name MUKEY"
Selecting a column that does not exist in the table in the from clause cause the error "Invalid query: Invalid column name MUKEY"
Incorrect syntax will cause the error "Invalid query: Incorrect syntax near the keyword ...
Aggregate function used and select fields are not in the group by clause will cause the error "Invalid query: Column value in in valid in the select list because it is not contained in either an aggregate function or the GROUP BY clause".

Helping People Help the Land

USDA is an equal opportunity provider, employer, and lender.