Skip to content

完整指南:掌握优秀系统设计的原则与实践

Published:

原文链接


I see a lot of bad system design advice. One classic is the LinkedIn-optimized “bet you never heard of queues” style of post, presumably aimed at people who are new to the industry. Another is the Twitter-optimized “you’re a terrible engineer if you ever store booleans in a database” clever trick1. Even good system design advice can be kind of bad. I love Designing Data-Intensive Applications, but I don’t think it’s particularly useful for most system design problems engineers will run into.我看到很多糟糕的系统设计建议。一个经典是 LinkedIn 优化的“打赌你从未听说过排队”风格的帖子,大概是针对该行业的新手。另一个是 Twitter 优化的“如果你在数据库中存储布尔值,你就是一个糟糕的工程师”的聪明技巧1。即使是好的系统设计建议也可能有点糟糕。我喜欢设计数据密集型应用程序,但我认为它对于工程师会遇到的大多数系统设计问题并不是特别有用。

What is system design? In my view, if software design is how you assemble lines of code, system design is how you assemble services. The primitives of software design are variables, functions, classes, and so on. The primitives of system design are app servers, databases, caches, queues, event buses, proxies, and so on. 什么是系统设计?在我看来,如果说软件设计是你组装代码行的方式,那么系统设计就是你组装服务的方式。软件设计的原语是变量、函数、类等。系统设计的原语是应用程序服务器、数据库、缓存、队列、事件总线、代理等。

This post is my attempt to write down, in broad strokes, everything I know about good system design. A lot of the concrete judgment calls do come down to experience, which I can’t convey in this post. But I’m trying to write down what I can.这篇文章是我试图粗略地写下我所知道的关于良好系统设计的一切。许多具体的判断确实归结为经验,我无法在这篇文章中表达这一点。但我正在努力写下我能写下的东西。

Recognizing good design认可优秀的设计

What does good system design look like? I’ve written before that it looks underwhelming. In practice, it looks like nothing going wrong for a long time. You can tell that you’re in the presence of good design if you have thoughts like “huh, this ended up being easier than I expected”, or “I never have to think about this part of the system, it’s fine”. Paradoxically, good design is self-effacing: bad design is often more impressive than good. I’m always suspicious of impressive-looking systems. If a system has distributed-consensus mechanisms, many different forms of event-driven communication, CQRS, and other clever tricks, I wonder if there’s some fundamental bad decision that’s being compensated for (or if the system is just straightforwardly over-designed).好的系统设计是什么样的?我之前写过,它看起来平淡无奇。在实践中,很长一段时间看起来都没有出错。如果你有诸如“呵呵,这最终比我预期的要容易”或“我永远不必考虑系统的这一部分,没关系”之类的想法,你就可以看出你处于优秀的设计之中。矛盾的是,好的设计是自我贬低的:坏的设计往往比好的设计更令人印象深刻。我总是对外观令人印象深刻的系统持怀疑态度。如果一个系统具有分布式共识机制、许多不同形式的事件驱动通信、CQRS 和其他巧妙的技巧,我想知道是否有一些根本性的错误决定得到了补偿(或者系统是否只是直接过度设计)。

I’m often alone on this. Engineers look at complex systems with many interesting parts and think “wow, a lot of system design is happening here!” In fact, a complex system usually reflects an absence of good design. I say “usually” because sometimes you do need complex systems. I’ve worked on many systems that earned their complexity. However, a complex system that works always evolves from a simple system that works. Beginning from scratch with a complex system is a really bad idea.我经常独自一人。工程师们看到具有许多有趣部件的复杂系统,并想“哇,这里正在进行很多系统设计!事实上,复杂的系统通常反映了缺乏良好的设计。我说“通常”是因为有时你确实需要复杂的系统。我参与过许多系统,这些系统都带来了它们的复杂性。然而,一个有效的复杂系统总是从一个有效的简单系统演变而来。从头开始使用复杂的系统是一个非常糟糕的主意。

State and statelessness状态和无状态

The hard part about software design is state. If you’re storing any kind of information for any amount of time, you have a lot of tricky decisions to make about how you save, store and serve it. If you’re not storing information2, your app is “stateless”. As a non-trivial example, GitHub has an internal API that takes a PDF file and returns a HTML rendering of it. That’s a real stateless service. Anything that writes to a database is stateful.软件设计的难点是状态。如果您要存储任何类型的信息任何时间,您就需要就如何保存、存储和提供这些信息做出许多棘手的决定。如果您不存储信息2,则您的应用是“无状态”的。作为一个重要的例子,GitHub 有一个内部 API,它接受一个 PDF 文件并返回它的 HTML 渲染。这是一个真正的无状态服务。任何写入数据库的内容都是有状态的。

You should try and minimize the amount of stateful components in any system. (In a sense this is trivially true, because you should try to minimize the amount of all components in a system, but stateful components are particularly dangerous.) The reason you should do this is that stateful components can get into a bad state. Our stateless PDF-rendering service will safely run forever, as long as you’re doing broadly sensible things: e.g. running it in a restartable container so that if anything goes wrong it can be automatically killed and restored to working order. A stateful service can’t be automatically repaired like this. If your database gets a bad entry in it (for instance, an entry with a format that triggers a crash in your application), you have to manually go in and fix it up. If your database runs out of room, you have to figure out some way to prune unneeded data or expand it.您应该尝试尽量减少任何系统中有状态组件的数量。(从某种意义上说,这是微不足道的,因为您应该尽量减少系统中所有组件的数量,但有状态组件特别危险。您应该这样做的原因是有状态组件可能会进入不良状态。我们的无状态 PDF 渲染服务将永远安全地运行,只要您正在做广泛合理的事情:例如,在可重新启动的容器中运行它,以便在出现任何问题时可以自动终止它并恢复到工作状态。有状态服务无法像这样自动修复。如果您的数据库中有一个错误的条目(例如,格式的条目会触发应用程序崩溃),您必须手动进入并修复它。如果您的数据库空间不足,您必须想出某种方法来修剪不需要的数据或扩展它。

What this means in practice is having one service that knows about the state - i.e. it talks to a database - and other services that do stateless things. Avoid having five different services all write to the same table. Instead, have four of them send API requests (or emit events) to the first service, and keep the writing logic in that one service. If you can, it’s worth doing this for the read logic as well, although I’m less absolutist about this. It’s sometimes better for services to do a quick read of the user_sessions table than to make a 2x slower HTTP request to an internal sessions service.这在实践中意味着拥有一个知道状态的服务(即它与数据库通信)和其他执行无状态作的服务。避免让五个不同的服务都写入同一个表。相反,让其中四个向第一个服务发送 API 请求(或发出事件),并将写入逻辑保留在该服务中。如果可以的话,对于读取逻辑也值得这样做,尽管我对此不那么绝对主义。有时,服务快速读取user_sessions表比向内部会话服务发出慢 2 倍的 HTTP 请求要好。

Databases

Since managing state is the most important part of system design, the most important component is usually where that state lives: the database. I’ve spent most of my time working with SQL databases (MySQL and PostgreSQL), so that’s what I’m going to talk about.由于管理状态是系统设计中最重要的部分,因此最重要的组件通常是该状态所在的位置:数据库。我大部分时间都在使用 SQL 数据库(MySQL 和 PostgreSQL),所以这就是我要讨论的内容。

Schemas and indexes架构和索引

If you need to store something in a database, the first thing to do is define a table with the schema you need. Schema design should be flexible, because once you have thousands or millions of records, it can be an enormous pain to change the schema. However, if you make it too flexible (e.g. by sticking everything in a “value” JSON column, or using “keys” and “values” tables to track arbitrary data) you load a ton of complexity into the application code (and likely buy some very awkward performance constraints). Drawing the line here is a judgment call and depends on specifics, but in general I aim to have my tables be human-readable: you should be able to go through the database schema and get a rough idea of what the application is storing and why.如果您需要在数据库中存储某些内容,首先要做的是使用您需要的模式定义一个表。模式设计应该是灵活的,因为一旦您拥有数千或数百万条记录,更改模式可能会非常痛苦。但是,如果您让它过于灵活(例如,通过将所有内容粘贴在“值”JSON 列中,或使用“键”和“值”表来跟踪任意数据),则会在应用程序代码中加载大量复杂性(并且可能会购买一些非常尴尬的性能约束)。在这里划清界限是一个判断,取决于具体情况,但总的来说,我的目标是让我的表是人类可读的:您应该能够浏览数据库模式并粗略了解应用程序正在存储的内容和原因。

If you expect your table to ever be more than a few rows, you should put indexes on it. Try to make your indexes match the most common queries you’re sending (e.g. if you query by email and type, create an index with those two fields). Indexes work like nested dictionaries, so make sure to put the highest-cardinality fields first (otherwise each index lookup will have to scan all users of type to find the one with the right email). Don’t index on every single thing you can think of, since each index adds write overhead.如果您希望您的表超过几行,则应在其上放置索引。尝试使您的索引与您发送的最常见查询匹配(例如,如果您通过电子邮件和类型进行查询,请使用这两个字段创建一个索引)。索引的工作方式类似于嵌套字典,因此请确保将最高基数字段放在前面(否则,每次索引查找都必须扫描所有类型的用户才能找到具有正确电子邮件的用户)。不要对你能想到的每一件事都进行索引,因为每个索引都会增加写入开销。

Bottlenecks

Accessing the database is often the bottleneck in high-traffic applications. This is true even when the compute side of things is relatively inefficient (e.g. Ruby on Rails running on a preforking server like Unicorn). That’s because complex applications need to make a lot of database calls - hundreds and hundreds for every single request, often sequentially (because you don’t know if you need to check whether a user is part of an organization until after you’ve confirmed they’re not abusive, and so on). How can you avoid getting bottlenecked?访问数据库通常是高流量应用程序的瓶颈。即使计算方面的效率相对较低(例如,Ruby on Rails 在像 Unicorn 这样的预分叉服务器上运行),也是如此。这是因为复杂的应用程序需要进行大量数据库调用 - 每个请求需要数百次,通常是按顺序调用的(因为在确认用户没有滥用行为之前,您不知道是否需要检查用户是否是组织的一部分,等等)。如何避免成为瓶颈?

When querying the database, query the database. It’s almost always more efficient to get the database to do the work than to do it yourself. For instance, if you need data from multiple tables, JOIN them instead of making separate queries and stitching them together in-memory. Particularly if you’re using an ORM, beware accidentally making queries in an inner loop. That’s an easy way to turn a select id, name from table to a select id from table and a hundred select name from table where id = ?.查询数据库时,查询数据库。让数据库完成这项工作几乎总是比自己完成更有效。例如,如果您需要来自多个表的数据,请将它们连接起来,而不是进行单独的查询并在内存中将它们拼接在一起。特别是如果您使用的是 ORM,请注意意外地在内部循环中进行查询。这是一种简单的方法,可以将 select id、name 从表转换为 select id 从表中选择 id,并从表中选择 100 个 name (其中 id = ?)。

Every so often you do want to break queries apart. It doesn’t happen often, but I’ve run into queries that were ugly enough that it was easier on the database to split them up than to try to run them as a single query. I’m sure it’s always possible to construct indexes and hints such that the database can do it better, but the occasional tactical query-split is a tool worth having in your toolbox.每隔一段时间,您确实想将查询分开。这种情况并不经常发生,但我遇到过非常丑陋的查询,以至于数据库将它们拆分比尝试将它们作为单个查询运行更容易。我确信构建索引和提示总是可以让数据库做得更好,但偶尔的战术查询拆分是值得在您的工具箱中拥有的工具。

Send as many read queries as you can to database replicas. A typical database setup will have one write node and a bunch of read-replicas. The more you can avoid reading from the write node, the better - that write node is already busy enough doing all the writes. The exception is when you really, really can’t tolerate any replication lag (since read-replicas are always running at least a handful of ms behind the write node). But in most cases replication lag can be worked around with simple tricks: for instance, when you update a record but need to use it right after, you can fill in the updated details in-memory instead of immediately re-reading after a write.

Beware spikes of queries (particularly write queries, and particularly transactions). Once a database gets overloaded, it gets slow, which makes it more overloaded. Transactions and writes are good at overloading databases, because they require a lot of database work for each query. If you’re designing a service that might generate massive query spikes (e.g. some kind of bulk-import API), consider throttling your queries.当心查询(尤其是写入查询,尤其是事务)的峰值。一旦数据库过载,它就会变慢,从而使其更加过载。事务和写入擅长重载数据库,因为它们需要每个查询进行大量的数据库工作。如果您正在设计可能产生大量查询峰值的服务(例如某种批量导入 API),请考虑限制查询。

Slow operations, fast operations慢作,快速作

A service has to do some things fast. If a user is interacting with something (say, an API or a web page), they should see a response within a few hundred ms3. But a service has to do other things that are slow. Some operations just take a long time (converting a very large PDF to HTML, for instance). The general pattern for this is splitting out the minimum amount of work needed to do something useful for the user and doing the rest of the work in the background. In the PDF-to-HTML example, you might render the first page to HTML immediately and queue up the rest in a background job.服务必须快速完成某些事情。如果用户正在与某些内容(例如 API 或网页)交互,他们应该会在几百毫秒内看到响应3。但是服务必须执行其他缓慢的事情。有些作只是需要很长时间(例如,将非常大的 PDF 转换为 HTML)。这样做的一般模式是拆分对用户有用的事情所需的最少工作量,并在后台完成其余工作。在 PDF 到 HTML 示例中,您可以立即将第一页呈现为 HTML,并将其余页面排队在后台作业中。

What’s a background job? It’s worth answering this in detail, because “background jobs” are a core system design primitive. Every tech company will have some kind of system for running background jobs. There will be two main components: a collection of queues, e.g. in Redis, and a job runner service that will pick up items from the queues and execute them. You enqueue a background job by putting an item like {job_name, params} on the queue. It’s also possible to schedule background jobs to run at a set time (which is useful for periodic cleanups or summary rollups). Background jobs should be your first choice for slow operations, because they’re typically such a well-trodden path.什么是后台工作?值得详细回答这个问题,因为“后台作业”是一个核心系统设计原语。每家科技公司都会有某种系统来运行后台作业。将有两个主要组件:队列集合,例如在 Redis 中,以及从队列中获取项目并执行它们的作业运行器服务。通过将 {job_name, params} 之类的项放入队列来将后台作业排队。还可以将后台作业安排在设定的时间运行(这对于定期清理或摘要汇总很有用)。后台作业应该是缓慢作的首选,因为它们通常是一条经常走的路径。

Sometimes you want to roll your own queue system. For instance, if you want to enqueue a job to run in a month, you probably shouldn’t put an item on the Redis queue. Redis persistence is typically not guaranteed over that period of time (and even if it is, you likely want to be able to query for those far-future enqueued jobs in a way that would be tricky with the Redis job queue). In this case, I typically create a database table for the pending operation with columns for each param plus a scheduled_at column. I then use a daily job to check for these items with scheduled_at <= today, and either delete them or mark them as complete once the job has finished.有时您想滚动自己的队列系统。例如,如果您想将作业排入队列以在一个月内运行,则可能不应该将项目放在 Redis 队列中。通常无法保证在这段时间内的 Redis 持久性(即使可以保证,您也可能希望能够以 Redis 作业队列会很棘手的方式查询那些遥远的未来排队作业)。在这种情况下,我通常为挂起的作创建一个数据库表,其中包含每个参数的列和一个scheduled_at列。然后,我使用日常工作来检查这些项目scheduled_at <= today, and either delete them or mark them as complete once the job has finished.

Caching

Sometimes an operation is slow because it needs to do an expensive (i.e. slow) task that’s the same between users. For instance, if you’re calculating how much to charge a user in a billing service, you might need to do an API call to look up the current prices. If you’re charging users per-use (like OpenAI does per-token), that could (a) be unacceptably slow and (b) cause a lot of traffic for whatever service is serving the prices. The classic solution here is caching: only looking up the prices every five minutes, and storing the value in the meantime. It’s easiest to cache in-memory, but using some fast external key-value store like Redis or Memcached is also popular (since it means you can share one cache across a bunch of app servers).有时,作很慢,因为它需要执行用户之间相同的昂贵(即缓慢)任务。例如,如果您要计算在计费服务中向用户收取多少费用,则可能需要执行 API 调用来查找当前价格。如果您按使用向用户收费(就像 OpenAI 按代币收费一样),则可能 (a) 速度慢得令人无法接受,并且 (b) 为任何服务提供价格的服务都会导致大量流量。这里的经典解决方案是缓存:每五分钟只查找一次价格,并同时存储价值。在内存中缓存是最简单的,但使用一些快速的外部键值存储(如 Redis 或 Memcached)也很受欢迎(因为这意味着您可以在一堆应用程序服务器之间共享一个缓存)。

The typical pattern is that junior engineers learn about caching and want to cache everything, while senior engineers want to cache as little as possible. Why is that? It comes down to the first point I made about the danger of statefulness. A cache is a source of state. It can get weird data in it, or get out-of-sync with the actual truth, or cause mysterious bugs by serving stale data, and so on. You should never cache something without first making a serious effort to speed it up. For instance, it’s silly to cache an expensive SQL query that isn’t covered by a database index. You should just add the database index!典型的模式是初级工程师学习缓存并希望缓存所有内容,而高级工程师则希望尽可能少地缓存。为什么?这归结为我提出的第一点,即状态性的危险。缓存是状态源。它可能会在其中获取奇怪的数据,或者与实际真相不同步,或者通过提供陈旧的数据来导致神秘的错误,等等。在未先认真努力加快速度的情况下,您永远不应该缓存某些内容。例如,缓存数据库索引未涵盖的昂贵 SQL 查询是愚蠢的。你应该只添加数据库索引!

I use caching a lot. One useful caching trick to have in the toolbox is using a scheduled job and a document storage like S3 or Azure Blob Storage as a large-scale persistent cache. If you need to cache the result of a really expensive operation (say, a weekly usage report for a large customer), you might not be able to fit the result in Redis or Memcached. Instead, stick a timestamped blob of the results in your document storage and serve the file directly from there. Like the database-backed long-term queue I mentioned above, this is an example of using the caching idea without using a specific cache technology.我经常使用缓存。工具箱中一个有用的缓存技巧是使用计划作业和文档存储(如 S3 或 Azure Blob 存储)作为大规模持久缓存。如果您需要缓存非常昂贵的作的结果(例如,大型客户的每周使用情况报告),您可能无法在 Redis 或 Memcached 中适应该结果。相反,将带有时间戳的结果 blob 粘贴到文档存储中,然后直接从那里提供文件。就像我上面提到的数据库支持的长期队列一样,这是一个使用缓存思想而不使用特定缓存技术的示例。

Events

As well as some kind of caching infrastructure and background job system, tech companies will typically have an event hub. The most common implementation of this is Kafka. An event hub is just a queue - like the one for background jobs - but instead of putting “run this job with these params” on the queue, you put “this thing happened” on the queue. One classic example is firing off a “new account created” event for each new account, and then having multiple services consume that event and take some action: a “send a welcome email” service, a “scan for abuse” service, a “set up per-account infrastructure” service, and so on.除了某种缓存基础设施和后台工作系统外,科技公司通常还会有一个事件中心。最常见的实现是 Kafka。事件中心只是一个队列(就像后台作业的队列一样),但不是在队列上放置“使用这些参数运行此作业”,而是将“此事情发生”放在队列中。一个典型的例子是为每个新帐户触发一个“创建新帐户”事件,然后让多个服务使用该事件并执行一些作:“发送欢迎电子邮件”服务、“扫描滥用”服务、“设置每个帐户的基础设施”服务等。

You shouldn’t overuse events. Much of the time it’s better to just have one service make an API request to another service: all the logs are in the same place, it’s easier to reason about, and you can immediately see what the other service responded with. Events are good for when the code sending the event doesn’t necessarily care what the consumers do with the event, or when the events are high-volume and not particularly time-sensitive (e.g. abuse scanning on each new Twitter post).您不应该过度使用事件。大多数时候,最好只让一个服务向另一个服务发出 API 请求:所有日志都在同一个地方,更容易推理,并且您可以立即看到另一个服务的响应。当发送事件的代码不一定关心消费者对事件做什么时,或者当事件数量大且对时间不是特别敏感时(例如,对每个新的 Twitter 帖子进行滥用扫描)时,事件是合适的。

Pushing and pulling推拉

When you need data to flow from one place to a lot of other places, there are two options. The simplest is to pull. This is how most websites work: you have a server that owns some data, and when a user wants it they make a request (via their browser) to the server to pull that data down to them. The problem here is that users might do a lot of pulling down the same data - e.g. refreshing their email inbox to see if they have any new emails, which will pull down and reload the entire web application instead of just the data about the emails.当您需要将数据从一个地方流向许多其他地方时,有两种选择。最简单的就是拉。这就是大多数网站的工作方式:您有一个拥有一些数据的服务器,当用户需要时,他们会向服务器发出请求(通过他们的浏览器)将该数据提取给他们。这里的问题是,用户可能会做很多相同的数据——例如,刷新他们的电子邮件收件箱以查看他们是否有任何新电子邮件,这将下拉并重新加载整个 Web 应用程序,而不仅仅是有关电子邮件的数据。

The alternative is to push. Instead of allowing users to ask for the data, you allow them to register as clients, and then when the data changes, the server pushes the data down to each client. This is how GMail works: you don’t have to refresh the page to get new emails, because they’ll just appear when they arrive.另一种选择是推动。您不允许用户请求数据,而是允许他们注册为客户端,然后当数据发生变化时,服务器将数据向下推送到每个客户端。这就是 GMail 的工作原理:您无需刷新页面即可获取新电子邮件,因为它们只会在它们到达时出现。

If we’re talking about background services instead of users with web browsers, it’s easy to see why pushing can be a good idea. Even in a very large system, you might only have a hundred or so services that need the same data. For data that doesn’t change much, it’s much easier to make a hundred HTTP requests (or RPC, or whatever) whenever the data changes than to serve up the same data a thousand times a second.如果我们谈论的是后台服务而不是使用 Web 浏览器的用户,那么很容易理解为什么推送是一个好主意。即使在一个非常大的系统中,您可能也只有一百个左右需要相同数据的服务。对于变化不大的数据,每当数据发生变化时发出一百个 HTTP 请求(或 RPC 或其他什么),比每秒提供一千次相同的数据要容易得多。

Suppose you did need to serve up-to-date data to a million clients (like GMail, does). Should those clients be pushing or pulling? It depends. Either way, you won’t be able to run it all from a single server, so you’ll need to farm it out to other components of the system. If you’re pushing, that will likely mean sticking each push on an event queue and having a horde of event processors each pulling from the queue and sending out your pushes. If you’re pulling, that will mean standing up a bunch (say, a hundred) of fast4 read-replica cache servers that will sit in front of your main application and handle all the read traffic5.假设您确实需要向一百万个客户提供最新数据(就像 GMail 一样)。这些客户应该推还是拉?这取决于。无论哪种方式,您都无法从单个服务器运行所有服务器,因此您需要将其扩展到系统的其他组件。如果你正在推送,这可能意味着将每个推送粘在事件队列上,并让一大群事件处理器从队列中拉出并发送你的推送。如果你正在拉取,那将意味着建立一堆(比如一百个)fast4 只读副本缓存服务器,这些服务器将位于你的主应用程序前面并处理所有读取流量5。

Hot paths

When you’re designing a system, there are lots of different ways users can interact with it or data can flow through it. It can get a bit overwhelming. The trick is to mainly focus on the “hot paths”: the part of the system that is most critically important, and the part of the system that is going to handle the most data. For instance, in a metered billing system, those pieces might be the part that decides whether or not a customer gets charged, and the part that needs to hook into all user actions on the platform to identify how much to charge.

Hot paths are important because they have fewer possible solutions than other design areas. There are a thousand ways you can build a billing settings page and they’ll all mainly work. But there might be only a handful of ways that you can sensibly consume the firehose of user actions. Hot paths also go wrong more spectacularly. You have to really screw up a settings page to take down the entire product, but any code you write that’s triggered on all user actions can easily cause huge problems.热路径很重要,因为它们的可能解决方案比其他设计领域少。您可以通过一千种方式构建计费设置页面,它们都主要有效。但是,您可能只有少数几种方法可以明智地使用用户作的消防水带。热路径也会出错更明显。你必须真的搞砸一个设置页面才能关闭整个产品,但你编写的任何代码都很容易导致巨大的问题。

Logging and metrics日志记录和指标

How do you know if you’ve got problems? One thing I’ve learned from my most paranoid colleagues is to log aggressively during unhappy paths. If you’re writing a function that checks a bunch of conditions to see if a user-facing endpoint should respond 422, you should log out the condition that was hit. If you’re writing billing code, you should log every decision made (e.g. “we’re not billing for this event because of X”). Many engineers don’t do this because it adds a bunch of logging boilerplate and makes it hard to write beautifully elegant code, but you should do it anyway. You’ll be happy you did when an important customer is complaining that they’re getting a 422 - even if that customer did something wrong, you still need to figure out what they did wrong for them.你怎么知道你有没有问题?我从最偏执的同事那里学到的一件事是在不快乐的道路上积极记录。如果您正在编写一个函数来检查一堆条件以查看面向用户的端点是否应该响应 422,则应注销命中的条件。如果您正在编写计费代码,则应记录所做的每项决定(例如,“由于 X,我们不会为此事件计费”)。许多工程师不这样做,因为它增加了一堆日志记录样板,并且很难编写出精美优雅的代码,但无论如何你都应该这样做。当一位重要客户抱怨他们收到了 422 时,您会很高兴自己这样做了 - 即使该客户做错了什么,您仍然需要弄清楚他们为他们做错了什么。

You should also have basic observability into the operational parts of the system. That means CPU/memory on the hosts or containers, queue sizes, average time per-request or per-job, and so on. For user-facing metrics like time per-request, you also need to watch the p95 and p99 (i.e. how slow your slowest requests are). Even one or two very slow requests are scary, because they’re disproportionately from your largest and most important users. If you’re just looking at averages, it’s easy to miss the fact that some users are finding your service unusable.您还应该对系统的作部分具有基本的可观察性。这意味着主机或容器上的 CPU/内存、队列大小、每个请求或每个作业的平均时间等。对于面向用户的指标,例如每个请求的时间,您还需要观察 p95 和 p99(即最慢的请求的速度)。即使是一两个非常慢的请求也很可怕,因为它们不成比例地来自您最大和最重要的用户。如果您只看平均值,很容易忽略一些用户发现您的服务无法使用的事实。

Killswitches, retries, and failing gracefully正常终止开关、重试和失败

I wrote a whole post about killswitches that I won’t repeat here, but the gist is that you should think carefully about what happens when the system fails badly.我写了一篇关于终止开关的整篇文章,我不会在这里重复,但要点是你应该仔细考虑当系统严重失败时会发生什么。

Retries are not a magic bullet. You need to make sure you’re not putting extra load on other services by blindly retrying failed requests. If you can, put high-volume API calls inside a “circuit breaker”: if you get too many 5xx responses in a row, stop sending requests for a while to let the service recover. You also need to make sure you’re not retrying write events that may or may not have succeeded (for instance, if you send a “bill this user” request and get back a 5xx, you don’t know if the user has been billed or not). The classic solution to this is to use an “idempotency key”, which is a special UUID in the request that the other service uses to avoid re-running old requests: every time they do something, they save the idempotency key, and if they get another request with the same key, they silently ignore it. 重试不是灵丹妙药。您需要确保不会通过盲目重试失败的请求来给其他服务带来额外的负载。如果可以的话,将大量 API 调用放入“断路器”中:如果连续收到太多 5xx 响应,请暂时停止发送请求,让服务恢复。您还需要确保您没有重试可能成功或可能未成功的写入事件(例如,如果您发送“向此用户计费”请求并返回 5xx,则不知道该用户是否已计费)。对此的经典解决方案是使用“幂等键”,这是请求中的一个特殊 UUID,其他服务使用它来避免重新运行旧请求:每次执行某事时,它们都会保存幂等键,如果它们收到另一个具有相同键的请求,它们会静默地忽略它。

It’s also important to decide what happens when part of your system fails. For instance, say you have some rate limiting code that checks a Redis bucket to see if a user has made too many requests in the current window. What happens when that Redis bucket is unavailable? You have two options: fail open and let the request through, or fail closed and block the request with a 429.决定当系统的一部分出现故障时会发生什么也很重要。例如,假设您有一些速率限制代码,用于检查 Redis 存储桶以查看用户是否在当前窗口中发出了太多请求。当该 Redis 存储桶不可用时会发生什么?您有两个选择:失败打开并让请求通过,或者失败关闭并使用 429 阻止请求。

Whether you should fail open or closed depends on the specific feature. In my view, a rate limiting system should almost always fail open. That means that a problem with the rate limiting code isn’t necessarily a big user-facing incident. However, auth should (obviously) always fail closed: it’s better to deny a user access to their own data than to give a user access to some other user’s data. There are a lot of cases where it’s not clear what the right behavior is. It’s often a difficult tradeoff.是否应该失效打开或关闭取决于具体功能。在我看来,速率限制系统几乎总是应该失败打开。这意味着速率限制代码的问题不一定是面向用户的重大事件。然而,身份验证(显然)应该总是失败关闭:拒绝用户访问自己的数据比让用户访问其他用户的数据要好。在很多情况下,不清楚什么是正确的行为。这通常是一个艰难的权衡。

Final thoughts最后的思考

There are some topics I’m deliberately not covering here. For instance, whether or when to split your monolith out into different services, when to use containers or VMs, tracing, good API design. Partly this is because I don’t think it matters that much (in my experience, monoliths are fine), or because I think it’s too obvious to talk about (you should use tracing), or because I just don’t have the time (API design is complicated).有些话题我故意不在这里讨论。例如,是否或何时将单体拆分为不同的服务、何时使用容器或虚拟机、跟踪、良好的 API 设计。部分原因是我认为这并不那么重要(根据我的经验,单体很好),或者因为我认为它太明显而无法谈论(你应该使用跟踪),或者因为我只是没有时间(API 设计很复杂)。

The main point I’m trying to make is what I said at the start of this post: good system design is not about clever tricks, it’s about knowing how to use boring, well-tested components in the right place. I’m not a plumber, but I imagine good plumbing is similar: if you’re doing something too exciting, you’re probably going to end up with crap all over yourself.我想表达的要点是我在这篇文章开头所说的:好的系统设计不是聪明的技巧,而是知道如何在正确的地方使用无聊的、经过充分测试的组件。我不是水管工,但我想好的管道也是类似的:如果你做一些太令人兴奋的事情,你最终可能会全身都是垃圾。

Especially at large tech companies, where these components already exist off the shelf (i.e. your company already has some kind of event bus, caching service, etc), good system design is going to look like nothing. There are very, very few areas where you want to do the kind of system design you could talk about at a conference. They do exist! I have seen hand-rolled data structures make features possible that wouldn’t have been possible otherwise. But I’ve only seen that happen once or twice in ten years. I see boring system design every single day.特别是在大型科技公司中,这些组件已经存在于现成的(即您的公司已经拥有某种事件总线、缓存服务等),良好的系统设计看起来什么都没有。很少有领域想要进行可以在会议上谈论的那种系统设计。它们确实存在!我已经看到手工滚动的数据结构使其他方式无法实现的功能成为可能。但我十年来只见过这种情况发生过一两次。我每天都会看到无聊的系统设计。

edit: this post was discussed on Hacker News with lots of good comments. I was amused by the comments that said “why even mention ‘don’t read your writes’, who would do that” right next to the comments that said “hmm, it seems way too fiddly to not read your writes”.编辑:这篇文章是在 Hacker News 上讨论的,有很多好评。我被那些评论逗乐了,这些评论说“为什么要提到’不要读你的文章’,谁会这样做”,旁边是“嗯,不阅读你的文章似乎太繁琐了”。

If you liked this post, consider subscribing to email updates about my new posts, or sharing it on Hacker News.如果您喜欢这篇文章,请考虑订阅有关我的新帖子的电子邮件更新,或在 Hacker News 上分享。

June 21, 2025 │ Tags: good engineers, software design2025 年 6 月 21 日 │ 标签: 优秀的工程师, 软件设计



Previous Post
Termix:一个基于Web的开源服务器管理平台,提供SSH终端、隧道和文件编辑功能
Next Post
Visual Story-Writing: 利用视觉化手段重塑故事创作的AI工具