First-time database design: am I overengineering?

20,795

Solution 1

Some more answers to your questions:

1) You're pretty much on target for someone who is approaching a problem like this for the first time. I think the pointers from others on this question thus far pretty much cover it. Good job!

2 & 3) The performance hit you will take will largely be dependent on having and optimizing the right indexes for your particular queries / procedures and more importantly the volume of records. Unless you are talking about well over a million records in your main tables you seem to be on track to having a sufficiently mainstream design that performance will not be an issue on reasonable hardware.

That said, and this relates to your question 3, with the start you have you probably shouldn't really be overly worried about performance or hyper-sensitivity to normalization orthodoxy here. This is a reporting server you are building, not a transaction based application backend, which would have a much different profile with respect to the importance of performance or normalization. A database backing a live signup and scheduling application has to be mindful of queries that take seconds to return data. Not only does a report server function have more tolerance for complex and lengthy queries, but the strategies to improve performance are much different.

For example, in a transaction based application environment your performance improvement options might include refactoring your stored procedures and table structures to the nth degree, or developing a caching strategy for small amounts of commonly requested data. In a reporting environment you can certainly do this but you can have an even greater impact on performance by introducing a snapshot mechanism where a scheduled process runs and stores pre-configured reports and your users access the snapshot data with no stress on your db tier on a per request basis.

All of this is a long-winded rant to illustrate that what design principles and tricks you employ may differ given the role of the db you're creating. I hope that's helpful.

Solution 2

You've got the right idea. You can however clean it up, and remove some of the mapping (has*) tables.

What you can do is in the Departments table, add CityId and DivisionId.

Besides that, I think everything is fine...

Solution 3

The only changes I would make are:
1- Change your VARCHAR to NVARCHAR, if you might be going international, you may want unicode.

2- Change your int id's to GUIDs (uniqueidentifier) if possible (this might just be my personal preference). Assuming you eventually get to the point where you have multiple environments (dev/test/staging/prod), you may want to migrate data from one to the other. Have GUID Ids makes this significantly easier.

3- Three layers for your Company -> Division -> Department structure may not be enough. Now, this might be over-engineering, but you could generalize that hierarchy such that you can support n-levels of depth. This will make some of your queries more complex, so that may not be worth the trade-off. Further, it could be that any client that has more layers may be easily "stuffable" into this model.

4- You also have a Status in the Client Table that is a VARCHAR and has no link to the Statuses table. I'd expect a little more clarity there as to what the Client Status represents.

Solution 4

No. It looks like you're designing at a good level of detail.

I think that Countries and Companies are really the same entity in your design, as are Cities and Divisions. I'd get rid of the Countries and Cities tables (and Cities_Has_Departments) and, if necessary, add a boolean flag IsPublicSector to the Companies table (or a CompanyType column if there are more choices than simply Private Sector / Public Sector).

Also, I think there's an error in your usage of the Departments table. It looks like the Departments table serves as a reference to the various kinds of departments that each customer division can have. If so, it should be called DepartmentTypes. But your clients (who are, I assume, attendees) do not belong to a department TYPE, they belong to an actual department instance in a company. As it stands now, you will know that a given client belongs to an HR department somewhere, but not which one!

In other words, Clients should be linked to the table that you call Divisions_Has_Departments (but that I would call simply Departments). If this is so, then you must collapse Cities into Divisions as discussed above if you want to use standard referential integrity in the database.

Solution 5

By the way, it's worth noting that if you're generating CSVs already and want to load them into a mySQL database, LOAD DATA LOCAL INFILE is your best friend: http://dev.mysql.com/doc/refman/5.1/en/load-data.html . Mysqlimport is also worth looking into, and is a command-line tool that's basically a nice wrapper around load data infile.

Share:
20,795

Related videos on Youtube

bob esponja
Author by

bob esponja

Updated on June 21, 2020

Comments

  • bob esponja
    bob esponja almost 4 years

    Background

    I'm a first year CS student and I work part time for my dad's small business. I don't have any experience in real world application development. I have written scripts in Python, some coursework in C, but nothing like this.

    My dad has a small training business and currently all classes are scheduled, recorded and followed up via an external web application. There is an export/"reports" feature but it is very generic and we need specific reports. We don't have access to the actual database to run the queries. I've been asked to set up a custom reporting system.

    My idea is to create the generic CSV exports and import (probably with Python) them into a MySQL database hosted in the office every night, from where I can run the specific queries that are needed. I don't have experience in databases but understand the very basics. I've read a little about database creation and normal forms.

    We may start having international clients soon, so I want the database to not explode if/when that happens. We also currently have a couple big corporations as clients, with different divisions (e.g. ACME parent company, ACME healthcare division, ACME bodycare division)

    The schema I have come up with is the following:

    1. From the client perspective:
      • Clients is the main table
      • Clients are linked to the department they work for
        • Departments can be scattered around a country: HR in London, Marketing in Swansea, etc.
        • Departments are linked to the division of a company
      • Divisions are linked to the parent company
    2. From the classes perspective:
      • Sessions is the main table
        • A teacher is linked to each session
        • A statusid is given to each session. E.g. 0 - Completed, 1 - Cancelled
        • Sessions are grouped into "packs" of an arbitrary size
      • Each packs is assigned to a client

    I "designed" (more like scribbled) the schema on a piece of paper, trying to keep it normalised to the 3rd form. I then plugged it into MySQL Workbench and it made it all pretty for me:
    (Click here for full-sized graphic)

    alt text
    (source: maian.org)

    Example queries I'll be running

    • Which clients with credit still left are inactive (those without a class scheduled in the future)
    • What is the attendance rate per client/department/division (measured by the status id in each session)
    • How many classes has a teacher had in a month
    • Flag clients who have low attendance rate
    • Custom reports for HR departments with attendance rates of people in their division

    Question(s)

    • Is this overengineered or am I headed the right way?
    • Will the need to join multiple tables for most queries result in a big performance hit?
    • I have added a 'lastsession' column to clients, as it is probably going to be a common query. Is this a good idea or should I keep the database strictly normalised?

    Thanks for your time

    • Adam Crossland
      Adam Crossland about 14 years
      Dear first year CS student: please keep using StackOverflow. Your question is interesting, well-written and helpful. In other words, you are in the top 1% of question askers.
    • Mark Schultheiss
      Mark Schultheiss about 14 years
      Can a Division contain other Divisions? IF that is the case a "has" table might be used to link the Division back to the Division it is contained by.
    • bob esponja
      bob esponja about 14 years
      Thanks for the kind comments :) Mark I'll have to go over the documentation for this project again, but I don't think we identified that case. Thanks for pointing it out.
    • james
      james about 9 years
      I don't like your primary key naming convensions. table divisions has column named divisionid. Don't you find that redundant? Just name it id. also your table names including _has_: i would remove that and just name it for example cities_departments. your DATETIME columns should be of type TIMESTAMP unless they're user-input values. I think it's a good idea to have the cities and countries tables. you may run into trouble limiting tables to a single status. consider using an INT and perform bitwise comparisons on it- so you can hold more meaning there
    • Jedi
      Jedi almost 8 years
      @binnyb There's a lot of argument about using id as the name of the primary key which people should consider before deciding .
    • Admin
      Admin about 6 years
      @bob esponja; Just my 5 cents: Do yourselfs a favour and use PostgreSQL instead of mySQL. All else is fine. One other advice: don't use float for money (use decimal instead), and don't use mySQL to store monetary data - it's the highway to hell. The DB-design seems to be fine otherwise.
    • Spivonious
      Spivonious about 5 years
      If no users will be inserting/updating/deleting data, forget about normalization and structure your tables to match the reports. You'll gain speed and eliminate joins. You're essentially looking to make a "Data Warehouse", where normalization usually just gets in the way.
  • Martin
    Martin about 14 years
    enums are evil. Everytime you need to extend the enum, you have to rebuild your table - which is OK until your table becomes many GB in size.
  • Jacob G
    Jacob G about 14 years
    I think he needs the mapping tables if he wants to re-use a department definition across different divisions or cities.
  • Reverend Gonzo
    Reverend Gonzo about 14 years
    Yes, I'll agree ..... but it sounded like a department could only be in one city/divison. If not, then what he had was definitely correct.
  • bob esponja
    bob esponja about 14 years
    I have a wiki article I wrote with a "spec" in the office, I'll have to read it again, but Jacob G is correct, IIRC there are some departments which span divisions. One HR department of ACME parent for both ACME healthcare and ACME bodycare. If I can simplify it though I certainly will, thanks for the suggestion.
  • bob esponja
    bob esponja about 14 years
    If I have understood you correctly it's not quite the case. The "courses" are just groups of subsequent sessions. It's not a traditional semester-based system. I can't think of anything else that could be added to the client domain, do you have any example? Also I was worried I had gone overboard already with the complexity, glad its not the case :) Thanks for your input.
  • bob esponja
    bob esponja about 14 years
    1- Thanks, I've been having trouble with diacritics and UTF8 for which I was going to post another question. Maybe this is the issue. 2- I've read some other questions here on SO with lots of conflicting opinions on the matter, I'll be doing more reading on the subject. 3- I'll speak this over with my dad again, looking at the "spec" I wrote and see if this is something we should look into. --Cont'd next comment
  • bob esponja
    bob esponja about 14 years
    4- I didn't go into it in the main question for brevity: the status on client is whether they're active (have sessions remaining) or inactive (no sessions remaining). By more clarity, do you mean a more descriptive name for the col? E.g. enrolment_status? Thanks for your input.
  • Jacob G
    Jacob G about 14 years
    re #4- In addition to your clearer name, if there are only two states, active/inactive, then why not just make it a bit column?
  • bob esponja
    bob esponja about 14 years
    Thanks for the input and suggestion Chris, I was worried I'd be creating an overly complex monster. Martin, the statuses are pretty well-defined and static: basically 0-Complete class, 1-Class cancelled, 2-Didn't turn up. I think these three cover any possible outcome of a class. Is it still a bad idea to use enums in this case?
  • Chris Dennett
    Chris Dennett about 14 years
    This seems perfect for an enum, in my mind. All possible outcomes are satisfied ahead of time. An int is fine too which you can represent by an enum or static ints in your app. Doesn't really matter :) Enums are nicer to look at if you edit your database using some tool.
  • bob esponja
    bob esponja about 14 years
    Thanks for pointing that out, when I did the diagram I was only really paying attention to the datatypes of the main columns and probably would have missed this later on. Bit column it is.
  • Martin
    Martin about 14 years
    enums can be problematic (perhaps evil is too strong a word) when you have large tables that must be online 24x7 and the enum needs to be changed. Given that you are repopulating the tables from scratch - don't worry about it. Given a small enough data set, you might as well just use strings.
  • bob esponja
    bob esponja about 14 years
    1. Do you mean adding all those columns to the client table? I think that would break the normalisation, and also make it hard to keep consistent, I'm not sure I understood correctly though. 2. Packs are sequential, only the most recent pack can have credit outstanding, so no need to track multiple packs. Would you still recommend storing it in client table in this case? 3. This seems like it will be very helpful figuring out the structure of client companies, I'll look into it thanks.
  • bob esponja
    bob esponja about 14 years
    4. I'll have to check the number of clients and sessions we expect to have over the next year, but its seems feasible to me for the sessions table to reach that many rows in a year or so. I'll look into reporting software, it hadn't occurred to me. 5. It seems that's the situation I've arrived to by accident; the web app will be our "transaction database" and this project our "repoting database" :) Thanks for your input.
  • bob esponja
    bob esponja about 14 years
    1. Thanks, that's reassuring! 2 & 3. I still don't know how indexes work, it's something I have planned to read up on. If we ever have the "problem" of reaching a million records there will probably be a budget to hire experienced developers :P Thanks for the insight into different db roles that exist, it's all new to me and very interesting to know. I'll look into snapshots as what you describe is basically the end goal of the project.
  • bob esponja
    bob esponja about 14 years
    The countries table is for if/when we have clients that operate in more than one country and have a different HR department for each one. That way we can create reports with data from the country the department we're dealing with operates in. Same for departments and cities, I think we have a client who has separate HR depts. for the two cities they have main offices in. Or at least that was the reasoning, I'll sit down and rethink it to see if they really are necessary. Hadn't thought of CompanyType, I'll find out if that's something we need to track.
  • bob esponja
    bob esponja about 14 years
    RE: depts table, my original thought track was to use it as actual departments, with the department name being the type. It hadn't occurred to me to just have department types, which seems more logical. About knowing which department and where someone belongs to, I had thought that having the department linked to a city and division (which is linked to a company) would have worked. Was I wrong? For collapsing Cities into Divisions, some Divisions span multiple cities, and I think maybe even countries. I'll look into it again. Thanks for your input.
  • bob esponja
    bob esponja about 14 years
    Try as I might, it keeps taking ovecalculates big O of helloworld.c, optimizes The cities and countries tables just kind of spawned themselves when I was following the steps to get a 3NF database. I guess the advantage they offer is coherence for city/country names. Like if we get a client in Munich and for some reason whoever enters a new student into the scheduling system decides to call it München instead of Munich like for the previous students. Also we might need to list departments by city, I'll have to check. Thanks.
  • dburges
    dburges about 14 years
    Disagree about the GUIDs, shudder. They can be horrible for performance. Don't use them unless you need to replciate.
  • dburges
    dburges about 14 years
    Optimizing in the design phase of a database is critical! It is not premature optimizing as databases are significantly harder to refacotr when they have million of records.
  • Hans Westerbeek
    Hans Westerbeek about 14 years
    I did not say he should not stress-test his design :)
  • Jacob G
    Jacob G about 14 years
    The performance only comes into play when you're talking 10's of millions of rows in a table. If you have that type of structure, then you can mitigate that with sequential guids and creative indexing. Otherwise, "performance" is a red herring when discounting GUIDs.
  • Will
    Will about 14 years
    1. Yes adding "Company ID, Department Description, Division Description, Department Type ID, Division Type ID" columns to the client table. Client belongs to one company, a distinct department type (IT/Ops/Admin/etc.) within a company and a distinct division type (Sales/HR/Marketing lines of business). 2. I just think Credit is associated with a client or company and not with the Pack of sessions. This is a business decision you can make.
  • Will
    Will about 14 years
    Larry also mentioned combining Company and Country. I totally agree and go back to the point regarding D&B reference. I'd use a SiteID or something unique to allow for multiple locations of the same company and then link the Departments to one of the unique SiteIDs.
  • Morgan Tocker
    Morgan Tocker almost 14 years
    RE: (1) nvarchar is an alias for varchar in MySQL. There's no point. Re: (2) As others have indicated, GUIDs are not a good choice for MySQL (assuming InnoDB). The argument that it is only a performance problem for "10's of millions/rows" is misleading.. if you reach that point accidentally it's not exactly easy to retrofit a smaller key. InnoDB clusters by the primary key and uses it as an internal identifier for secondary key indexes.
  • Jacob G
    Jacob G almost 14 years
    @Morgan, RE: GUIDs - Granted my experience is with SQL Server rather than MySQL or InnoDB, but I've had highly performant tables with GUID PKs in the hundreds of millions of rows. SQL Server allows for Sequential GUIDs, which can be efficiently indexed.
  • Morgan Tocker
    Morgan Tocker almost 14 years
    Yes, with other database problems GUID PK performance is likely not nearly as bad as InnoDB. It's not just about whether it makes the insertion sequential, it matters for size of the secondary indexes.
  • pojo-guy
    pojo-guy about 7 years
    If you understand tables, the fundamentals of indexes are pretty easy. Conceptually, an index can be (and often is) implemented as a table with a very few columns whose contents are copied from the main table, and a reference back to the main table, whose rows are keot sorted for rapid accessibility. B+Tree is the most common index arrangement, but index optimizations are where the big players have their differentiating technologies so it gets murky if you try to apply the analogy too deeply.