Best practice for storing tags in a database?

35,670

Solution 1

Use a many-to-many table to link a TAG record to an IMAGE record:

IMAGE

DROP TABLE IF EXISTS `example`.`image`;
CREATE TABLE  `example`.`image` (
  `image_id` int(10) unsigned NOT NULL auto_increment,
  PRIMARY KEY  (`image_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

TAG

DROP TABLE IF EXISTS `example`.`tag`;
CREATE TABLE  `example`.`tag` (
 `tag_id` int(10) unsigned NOT NULL auto_increment,
 `description` varchar(45) NOT NULL default '',
 PRIMARY KEY  (`tag_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

IMAGE_TAG_MAP

DROP TABLE IF EXISTS `example`.`image_tag_map`;
CREATE TABLE  `example`.`image_tag_map` (
 `image_id` int(10) unsigned NOT NULL default '0',
 `tag_id` int(10) unsigned NOT NULL default '0',
 PRIMARY KEY  (`image_id`,`tag_id`),
 KEY `tag_fk` (`tag_id`),
 CONSTRAINT `image_fk` FOREIGN KEY (`image_id`) REFERENCES `image` (`image_id`),
 CONSTRAINT `tag_fk` FOREIGN KEY (`tag_id`) REFERENCES `tag` (`tag_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Solution 2

You can make a tags table which is just an id and tag with a unique constraint on tag and then photo_tags table which has tag_id and photo_id. Insert a tag into the tags table only if it doesn't already exist.

Then you will be querying by a pk instead of varchar text comparison when doing queries like how many photos are tagged with a certain tag.

Solution 3

In multi tag search query you will have to hit every tag that is requested. Hence image tag set I has to be a superset of the request tag set U.

I >= U

To implement this complex comparison in SQL is a bit of challenge as each of the image has to be qualified individually. Given that tags are unique set per image:

SELECT i.* FROM images AS i WHERE {n} = (
  SELECT COUNT(*) 
  FROM image_tags AS t 
  WHERE t.image_id = i.image_id
    AND t.tag IN ({tag1}, {tag2}, ... {tagn})
)

Schema:

CREATE TABLE images (
  image_id varchar NOT NULL,
  PRIMARY KEY (image_id)
)

CREATE TABLE image_tags (
  image_id varchar NOT NULL,
  tag varchar NOT NULL,
  PRIMARY KEY (image_id, tag)
)

Solution 4

I believe that there is no right or wrong answer here, it all depends on the type of usage/data and wether you give the ability to edit a tag's title or adding other attributes on the tag.

BTW, there is a hybrid solution, that is much like the array of tags, but instead of saving them as texts save the ids of the tags. It will require 2 queries to get the list with the tags, but will allow you to manage the tags more easily.

Let's break it down for a minute, and take those 3 approaches with the 2 main use cases. Assuming N number of image, M number of tags, TN avg number of images per tag and TM avg number of tags per image (for example, we have 5K images (N=5K), 50 tags (M=50), 5 tags on avg for each image (TM = 5) and 20 images per tag (TN = 20)):

  1. Saving the list of tags on the image table
    • Getting list of images with their tags (without a filter by tag) is O(N) = ~5000
    • Filtering the list of images by certain tags, is O(N * TM) = ~25000 (5000 * 5)
  2. Separate tags in to a different table (with a connecting table)
    • Getting list of images with their tags (without a filter by tag) is O(N * log(N * TM) * log(M)) = ~37000 (5000 * log(25000) * log(50))
    • Filtering the list of images by certain tags, is O(log(M) * log(N) * log(N * TM) * TN) = ~552 (log(50) * log(5000) * log(25000) * 20)
  3. If you take the hybrid solution
    • Getting list of images with their tags (without a filter by tag) is s O(N + M * log(M)), with a potential improvement in case M is very small, you can always get all the tags what makes it O(N + M).(but keep in mind that this has the disadvantage of 2 roundtrips to the DB). = ~5084 (5000 + 50 * log(50)) / ~5050
    • Filtering the list of images by certain tags, is O(log(M) + N * TM).(but keep in mind that this has the disadvantage of 2 roundtrips to the DB) = ~25001 (log(50) + 5 * 5000)

* Of course that just putting the number in the complexity formulas is an over simplification of how it works, but should give a sense of the ballpark. * Some DBs has more complex types then just array, that using them will potentially give better results (like a hstore in postgres).

Share:
35,670
Admin
Author by

Admin

Updated on March 23, 2021

Comments

  • Admin
    Admin about 3 years

    I developed a site that uses tags (key words) in order to categorize photographs. Right now, what I have in my MySQL database is a table with the following structure:

    image_id (int)
    tag      (varchar(32))
    

    Every time someone tags an image (if the tag is valid and has enough votes) it's added to the database. I think that this isn't the optimal way of doing things since now that I have 5000+ images with tags, the tags table has over 40000 entries. I fear that this will begin to affect performance (if it's not already affecting it).

    I considered this other structure thinking that it'd be faster to fetch the tags associated to a particular image but then it looks horrible for when I want to get all the tags, or the most popular one for instance:

    image_id (int)
    tags     (text) //comma delimited list of tags for the image
    

    Is there a correct way of doing this or are both ways more or less the same? Thoughts?