Getting most used words from a column of strings in SQL
Solution 1
As Blogbeard said, the query you provided does not work with SQL Server. Here is one way to count the most used word. This is based from a function, DelimitedSplitN4K, written by Jeff Moden and improved by members of the SQL Server Central community.
WITH E1(N) AS (
SELECT 1 FROM (VALUES
(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
) t(N)
),
E2(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),
E4(N) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b)
SELECT TOP 50
x.Item,
COUNT(*)
FROM Posts p
CROSS APPLY (
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = LTRIM(RTRIM(SUBSTRING(p.Title, l.N1, l.L1)))
FROM (
SELECT s.N1,
L1 = ISNULL(NULLIF(CHARINDEX(' ',p.Title,s.N1),0)-s.N1,4000)
FROM(
SELECT 1 UNION ALL
SELECT t.N+1
FROM(
SELECT TOP (ISNULL(DATALENGTH(p.Title)/2,0))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E4
) t(N)
WHERE SUBSTRING(p.Title ,t.N,1) = ' '
) s(N1)
) l(N1, L1)
) x
WHERE x.item <> ''
GROUP BY x.Item
ORDER BY COUNT(*) DESC
Since creation of function is not allowed, I've written it that way. Here is the function definition if you're interested:
CREATE FUNCTION [dbo].[DelimitedSplitN4K](
@pString NVARCHAR(4000),
@pDelimiter NCHAR(1)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
),
E2(N) AS (SELECT 1 FROM E1 a, E1 b),
E4(N) AS (SELECT 1 FROM E2 a, E2 b),
cteTally(N) AS(
SELECT TOP (ISNULL(DATALENGTH(@pString)/2,0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
),
cteLen(N1,L1) AS(
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,4000)
FROM cteStart s
)
SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(@pString, l.N1, l.L1)
FROM cteLen l
;
And here is how you would use it:
SELECT TOP 50
x.Item,
COUNT(*)
FROM Posts p
CROSS APPLY dbo.DelimitedSplitN4K(p.Title, ' ') x
WHERE LTRIM(RTRIM(x.Item)) <> ''
GROUP BY x.Item
ORDER BY COUNT(*) DESC
The result:
Item
-------- -------
to 3812411
in 3331522
a 2543636
How 1770915
the 1534298
with 1341632
of 1297468
and 1166664
on 970554
from 964449
for 886007
not 835979
is 704724
using 703007
I 633838
- 632441
an 548450
when 449169
file 409717
how 358745
data 335271
do 323854
can 310298
get 305922
or 266317
error 263563
use 258408
value 254392
it 251254
my 238902
function 235832
by 231025
Android 228308
as 216654
array 209157
working 207445
does 207274
Is 205613
multiple 203336
that 197826
Why 196979
into 196591
after 192056
string 189053
PHP 187018
one 182360
class 179965
if 179590
text 174878
table 169393
Solution 2
Query solution (No Split Function Required)
PostgreSQL
select word, count(*) from
(
-- get 1st words
select split_part(title, ' ', 1) as word
from posts
union all
-- get 2nd words
select split_part(title, ' ', 2) as word
from posts
union all
-- get 3rd words
select split_part(title, ' ', 3) as word
from posts
-- can do this as many times as the number of words in longest title
) words
where word is not null
and word NOT IN ('', 'and', 'for', 'of', 'on')
group by word
order by count desc
limit 50;
for a concise version, see: https://dba.stackexchange.com/a/82456/95929
Solution 3
With the now available STRING_SPLIT function (since SQL Server 2016, Compatability Level 130) this query becomes much easier:
SELECT TOP 50
value [word]
, COUNT(*) [#times]
FROM posts p
CROSS APPLY STRING_SPLIT(p.title, ' ')
GROUP BY value
ORDER BY COUNT(*) DESC
See it in action on the Stack Exchange Data Explorer where it still runs under 2 minutes for the current number of posts in the Stack Overflow database. On Stack Overflow em Português it runs without having to fear for the dreaded timeout.
Results are similar to what you saw in the answer from Felix:
![jmac](https://i.stack.imgur.com/4FQgI.png?s=256&g=1)
jmac
Updated on July 24, 2022Comments
-
jmac almost 2 years
So we have this database filled with a bunch of strings, in this case post titles.
What I want to do is:
- Split the string up in to words
- Count how many times words appear in strings
- Give me to top 50 words
- Not have this timeout in a data.se query
I tried using the info from this SO question adapted to data.se as follows:
select word, count(*) from ( select (case when instr(substr(p.Title, nums.n+1), ' ') then substr(p.Title, nums.n+1) else substr(p.Title, nums.n+1, instr(substr(p.Title, nums.n+1), ' ') - 1) end) as word from (select ' '||Title as string from Posts p )Posts cross join (select 1 as n union all select 2 union all select 10 ) nums where substr(p.Title, nums.n, 1) = ' ' and substr(p.Title, nums.n, 1) <> ' ' ) w group by word order by count(*) desc
Unfortunately, this gives me a slew of errors:
'substr' is not a recognized built-in function name. Incorrect syntax near '|'. Incorrect syntax near 'nums'.
So given a column of strings in SQL with a variable amount of text in each string, how can I get a list of the most frequently used X words?
-
Tydis over 4 yearsIs there a way to modify this to provide a list of phrases rather than individual words? Also, is this compatible with an imported excel table rather than an SQL table?
-
volkit about 2 yearsThe question is tagged with sql-sever and tsql hence this answer for postgres is not a good fit here. You can find other easy postgres solution here: stackoverflow.com/questions/5226202/… and here dba.stackexchange.com/questions/145016/…