Postgresql: How do I ensure that indexes are in memory

10,983

Solution 1

Those seq scans make it look like you didn't run analyze on the table after importing your data.

http://www.postgresql.org/docs/current/static/sql-analyze.html

During normal operation, scheduling to run vacuum analyze isn't useful, because the autovacuum periodically kicks in. But it is important when doing massive writes, such as during imports.

On a slightly related note, see this reversed index tip on Pavel's PostgreSQL Tricks site, if you ever need to run anchord queries at the end, rather than at the beginning, e.g. like '%.com'

http://postgres.cz/wiki/PostgreSQL_SQL_Tricks_I#section_20


Regarding your actual questions, be wary that some of the suggestions in that post you liked to are dubious at best. Changing the cost of index use is frequently dubious and disabling seq scan is downright silly. (Sometimes, it is cheaper to seq scan a table than itis to use an index.)

With that being said:

  1. Postgres primarily caches indexes based on how often they're used, and it will not use an index if the stats suggest that it shouldn't -- hence the need to analyze after an import. Giving Postgres plenty of memory will, of course, increase the likelihood it's in memory too, but keep the latter points in mind.
  2. and 3. Full text search works fine.

For further reading on fine-tuning, see the manual and:

http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

Two last notes on your schema:

  1. Last I checked, bigint (bigserial in your case) was slower than plain int. (This was a while ago, so the difference might now be negligible on modern, 64-bit servers.) Unless you foresee that you'll actually need more than 2.3 billion entries, int is plenty and takes less space.
  2. From an implementation standpoint, the only difference between a varchar(300) and a varchar without a specified length (or text, for that matter) is an extra check constraint on the length. If you don't actually need data to fit that size and are merely doing so for no reason other than habit, your db inserts and updates will run faster by getting rid of that constraint.

Solution 2

Unless your encoding or collation is C or POSIX, an ordinary btree index cannot efficiently satisfy an anchored like query. You may have to declare a btree index with the varchar_pattern_ops op class to benefit.

Solution 3

The problem is that you're getting hit with a full table scan for each of those lookups ("index in memory" isn't really an issue). Each time you run one of those queries the database is visiting every single row, which is causing the high disk usage. You might check here for a little more information (especially follow the links to the docs on operator classes and index types). If you follow that advice you should be able to get prefix lookups working fine, i.e. those situations where you're matching something like 'orange%'.

Full text search is nice for more natural text search, like written documents, but it might be more difficult to get it working well for URL searching. There was also this thread in the mailing lists a few months back that might have more domain-specific information for what you're trying to do.

Share:
10,983
collumbo
Author by

collumbo

Updated on June 13, 2022

Comments

  • collumbo
    collumbo almost 2 years

    I have been trying out postgres 9.3 running on an Azure VM on Windows Server 2012. I was originally running it on a 7GB server... I am now running it on a 14GB Azure VM. I went up a size when trying to solve the problem described below.

    I am quite new to posgresql by the way, so I am only getting to know the configuration options bit by bit. Also, while I'd love to run it on Linux, I and my colleagues simply don't have the expertise to address issues when things go wrong in Linux, so Windows is our only option.

    Problem description:

    I have a table called test_table; it currently stores around 90 million rows. It will grow by around 3-4 million rows per month. There are 2 columns in test_table:

    id (bigserial)
    url (charachter varying 300)
    

    I created indexes after importing the data from a few CSV files. Both columns are indexed.... the id is the primary key. The index on the url is a normal btree created using the defaults through pgAdmin.

    When I ran:

    SELECT sum(((relpages*8)/1024)) as MB FROM pg_class WHERE reltype=0;
    

    ... The total size is 5980MB

    The indiviual size of the 2 indexes in question here are as follows, and I got them by running:

     # SELECT relname, ((relpages*8)/1024) as MB, reltype FROM pg_class WHERE 
      reltype=0 ORDER BY relpages DESC LIMIT 10;
    
    
                 relname      |  mb  | reltype
    ----------------------------------+------+--------
     test_url_idx             | 3684 |       0
     test_pk                  | 2161 |       0
    

    There are other indexes on other smaller tables, but they are tiny (< 5MB).... so I ignored them here

    The trouble when querying the test_table using the url, particularly when using a wildcard in the search, is the speed (or lack of it). e.g.

    select * from test_table where url like 'orange%' limit 20;
    

    ...would take anything from 20-40 seconds to run.

    Running explain analyze on the above gives the following:

    # explain analyze select * from test_table where
       url like 'orange%' limit 20;
    
              QUERY PLAN
    -----------------------------------------------------------------    
     Limit  (cost=0.00..4787.96 rows=20 width=57) 
         (actual time=0.304..1898.583 rows=20 loops=1)
       ->  Seq Scan on test_table  (cost=0.00..2303247.60 rows=9621 width=57)
         (actual time=0.302..1898
        .542 rows=20 loops=1)
         Filter: ((url)::text ~~ 'orange%'::text)
         Rows Removed by Filter: 210286
        Total runtime: 1898.650 ms
      (5 rows)
    

    Taking another example... this time with the wildcard between american and .com....

    # explain  select * from test_table where url 
       like 'american%.com' limit 50;
    
    QUERY PLAN
    -------------------------------------------------------
     Limit  (cost=0.00..11969.90 rows=50 width=57)
      ->  Seq Scan on test_table  (cost=0.00..2303247.60 rows=9621 width=57)
         Filter: ((url)::text ~~ 'american%.com'::text)
        (3 rows)
    
    
    # explain analyze select * from test_table where url 
        like 'american%.com' limit 50;
    
    QUERY PLAN
    -----------------------------------------------------
     Limit  (cost=0.00..11969.90 rows=50 width=57) 
        (actual time=83.470..3035.696 rows=50      loops=1)
        ->  Seq Scan on test_table  (cost=0.00..2303247.60 rows=9621 width=57) 
                (actual time=83.467..303
      5.614 rows=50 loops=1)
         Filter: ((url)::text ~~ 'american%.com'::text)
         Rows Removed by Filter: 276142
     Total runtime: 3035.774 ms
    (5 rows)
    

    I then went from a 7GB to a 14GB server. Query Speeds were no better.

    Observations on the server

    • I can see that Memory usage never really goes beyond 2MB.
    • Disk reads go off the charts when running a query using a LIKE statement.
    • Query speed is perfectly fine when matching against the id (primary key)

    The postgresql.conf file has had only a few changes from the defaults. Note that I took some of these suggestions from the following blog post: http://www.gabrielweinberg.com/blog/2011/05/postgresql.html.

    Changes to conf:

    shared_buffers = 512MB  
    
    checkpoint_segments = 10 
    

    (I changed checkpoint_segments as I got lots of warnings when loading in CSV files... although a production database will not be very write intensive so this can be changed back to 3 if necessary...)

    cpu_index_tuple_cost = 0.0005       
    effective_cache_size = 10GB    # recommendation in the blog post was 2GB...
    

    On the server itself, in the Task Manager -> Performance tab, the following are probably the relevant bits for someone who can assist:

    CPU: rarely over 2% (regardless of what queries are run... it hit 11% once when I was importing a 6GB CSV file)

    Memory: 1.5/14.0GB (11%)

    More details on Memory:

    • In use: 1.4GB
    • Available: 12.5GB
    • Committed 1.9/16.1 GB
    • Cached: 835MB
    • Paged Pool: 95.2MB
    • Non-paged pool: 71.2 MB

    Questions

    1. How can I ensure an index will sit in memory (providing it doesn't get too big for memory)? Is it just configuration tweaking I need here?
    2. Is implementing my own search index (e.g. Lucene) a better option here?
    3. Are the full-text indexing features in postgres going to improve performance dramatically, even if I can solve the index in memory issue?

    Thanks for reading.