Django是一款经典的Python Web开发框架,是最受欢迎的Python开源项目之一。不同于Flask框架,Django是高度集成的,可以帮助开发者快速搭建一个Web项目。 从上周开始,我们一起进入Djaong项目的源码解析,加深对django的理解,熟练掌握python web开发。在上篇文章中我们首先采用概读法,了解django项目的创建和启动过程。这篇文章我们同样使用概读法,了解diango中http协议的处理流程。这个流程也是客户端的请求如何在django中传递和处理,再返回客户端的过程。本文大概分下面几个部分:

  • tcp && http && wsgi 协议分层
  • 中间件(middleware)链表
  • URL路由搜索
  • 请求(request)和response响应

tcp && http && wsgi 协议分层

接上一篇,我们知道django在runserver命令中启动http服务入口:

1
2
3
4
5
6
7
8
# runserver.py
from django.core.servers.basehttp import (
    WSGIServer, get_internal_wsgi_application, run,
)
...
handler = get_internal_wsgi_application()
run(self.addr, int(self.port), handler,
    ipv6=self.use_ipv6, threading=threading, server_cls=WSGIServer)

http-server和wsgi-server的实现都在basehttp模块。下面是改模块的结构图:

从结构图总我们不难发现tcp,http和wsgi的结构层级主要关系大概如下图:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
+------------+      +--------------------+
| WSGIServer <------+ WSGIRequestHandler |
+-----+------+      +------+-------------+
      |                    |
      |                    |
+-----v------+      +------v-------------+
| HttpServer <------+ HttpRequestHandler |
+-----+------+      +------+-------------+
      |                    |
      |                    |
+-----v------+      +------v-------------+
|  TCPServer <------+  TCPRequestHandler |
+------------+      +--------------------+
  • 每一层服务的实现,都有对应的RequestHandler处理对应协议的请求
  • 客户端的请求从底层的tcp,封装成上层的http请求,再到wsgi请求
  • 上层类都继承自下层类

比较特别的是多出来的ServerHandler和WSGIRequestHandler的关系:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
class WSGIRequestHandler(BaseHTTPRequestHandler):

    def handle(self):
        """Handle a single HTTP request"""
        ...
        
        handler = ServerHandler(
            self.rfile, self.wfile, self.get_stderr(), self.get_environ(),
            multithread=False,
        )
        handler.request_handler = self      # backpointer for logging
        handler.run(self.server.get_app())

这里把http请求,委托给支持wsgi的application,也就是django的application。

关于http和wsgi的详细内容,可以参考之前的文章 wsgiref 源码阅读,这里不再详细介绍。

了解三层协议后,我们在看看application的动态载入:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def get_internal_wsgi_application():
    """
    Load and return the WSGI application as configured by the user in
    ``settings.WSGI_APPLICATION``. With the default ``startproject`` layout,
    this will be the ``application`` object in ``projectname/wsgi.py``.

    This function, and the ``WSGI_APPLICATION`` setting itself, are only useful
    for Django's internal server (runserver); external WSGI servers should just
    be configured to point to the correct application object directly.

    If settings.WSGI_APPLICATION is not set (is ``None``), return
    whatever ``django.core.wsgi.get_wsgi_application`` returns.
    """
    ...
    return get_wsgi_application()

application实际上是一个WSGIHandler对象实例:

1
2
3
4
5
6
7
8
9
def get_wsgi_application():
    """
    The public interface to Django's WSGI support. Return a WSGI callable.

    Avoids making django.core.handlers.WSGIHandler a public API, in case the
    internal WSGI implementation changes or moves in the future.
    """
    django.setup(set_prefix=False)
    return WSGIHandler()
  • 需要注意的是WSGIHandler和WSGIRequestHandler不是同一个类

这个WSGIHandle对象执行了wsgi的实现,接收wsgi-environ,在start_response中处理http状态码和http头,返回http响应。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class WSGIHandler(base.BaseHandler):
    request_class = WSGIRequest

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.load_middleware()

    def __call__(self, environ, start_response):
        set_script_prefix(get_script_name(environ))
        signals.request_started.send(sender=self.__class__, environ=environ)
        request = self.request_class(environ)
        response = self.get_response(request)

        response._handler_class = self.__class__

        status = '%d %s' % (response.status_code, response.reason_phrase)
        response_headers = [
            *response.items(),
            *(('Set-Cookie', c.output(header='')) for c in response.cookies.values()),
        ]
        start_response(status, response_headers)
        ...
        return response

另外一个非常重要的点是WSGIHandle对象在初始化时候进行了load_middleware,加载了django的中间件。

中间件(middleware)链条

在开始介绍中间件(middleware)之前,我们先了解一点点基础知识。装饰器模式是python中非常重要的设计模式,也是中间件的基础。下面是一个装饰器的简单示例:

1
2
3
4
5
6
from decorators import debug, do_twice

@debug
@do_twice
def greet(name):
    print(f"Hello {name}")

程序的运行结果:

1
2
3
4
5
>>> greet("Eva")
Calling greet('Eva')
Hello Eva
Hello Eva
'greet' returned None

我们可以看到整个过程是从外至内的,首先是debug装饰器生效,告知开始运行greet函数;然后是do_twice装饰器生效,调用了两次greet函数;最里层的greet目标函数最后执行并返回。取消掉@字符这个语法糖,上面函数的调用过程大概是这样的:

1
debug(do_twice(greet))(name)
  • debug和do_twice的参数和返回值都是函数
  • name是最后一个函数的参数

利用装饰器,我们可以在不修改目标函数的情况下,给函数增加各种额外功能。在django中这是中间件(middleware)的工作。下面是一个最简单的函数式中间件:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def simple_middleware(get_response):
    # One-time configuration and initialization.

    def middleware(request):
        # Code to be executed for each request before
        # the view (and later middleware) are called.

        response = get_response(request)

        # Code to be executed for each request/response after
        # the view is called.

        return response

    return middleware
  • simple_middleware是标准的装饰器实现
  • simple_middleware包装WSGIHandler的get_response方法
  • get_response函数处理request然后返回response
  • 2个注释部分预留了可以扩展的空间: 在目标函数执行之前对request进行处理和在目标函数执行之后对response进行处理

实际上所有的middlew都继承自MiddlewareMixin,在其中使用模版模式,定义了process_requestprocess_response两个待子类扩展的方法,进一步明确了中间件的处理位置:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
class MiddlewareMixin:

    def __call__(self, request):
        response = None
        if hasattr(self, 'process_request'):
            response = self.process_request(request)
        response = response or self.get_response(request)
        if hasattr(self, 'process_response'):
            response = self.process_response(request, response)
        return response

MiddlewareMixin是一个以类方式实现的装饰器

默认的CommonMiddleware中示例了扩展出一个通用的中间件(middleware): 继承自MiddlewareMixin,扩展process_request和process_response方法:

1
2
3
4
5
6
7
class CommonMiddleware(MiddlewareMixin):
    
    def process_request(self, request):
        pass
        
    def process_response(self, request, response):
        pass

下面是process_request的全部代码:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def process_request(self, request):
    """
    Check for denied User-Agents and rewrite the URL based on
    settings.APPEND_SLASH and settings.PREPEND_WWW
    """

    # Check for denied User-Agents
    user_agent = request.META.get('HTTP_USER_AGENT')
    if user_agent is not None:
        for user_agent_regex in settings.DISALLOWED_USER_AGENTS:
            if user_agent_regex.search(user_agent):
                raise PermissionDenied('Forbidden user agent')

    # Check for a redirect based on settings.PREPEND_WWW
    host = request.get_host()
    must_prepend = settings.PREPEND_WWW and host and not host.startswith('www.')
    redirect_url = ('%s://www.%s' % (request.scheme, host)) if must_prepend else ''

    # Check if a slash should be appended
    if self.should_redirect_with_slash(request):
        path = self.get_full_path_with_slash(request)
    else:
        path = request.get_full_path()

    # Return a redirect if necessary
    if redirect_url or path != request.get_full_path():
        redirect_url += path
        return self.response_redirect_class(redirect_url)
  • 检查http头中的HTTP_USER_AGENT是否被禁止
  • 检查请求的域名是否需要跳转

下面是process_response的全部代码:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def process_response(self, request, response):
    """
    When the status code of the response is 404, it may redirect to a path
    with an appended slash if should_redirect_with_slash() returns True.
    """
    # If the given URL is "Not Found", then check if we should redirect to
    # a path with a slash appended.
    if response.status_code == 404 and self.should_redirect_with_slash(request):
        return self.response_redirect_class(self.get_full_path_with_slash(request))

    # Add the Content-Length header to non-streaming responses if not
    # already set.
    if not response.streaming and not response.has_header('Content-Length'):
        response.headers['Content-Length'] = str(len(response.content))

    return response
  • 确保给http响应增加Content-Length的http头

了解单个中间件(middleware)实现后,我们继续看所有中间件协作原理。默认情况下会配置下面这些预制的中间件:

1
2
3
4
5
6
7
8
9
MIDDLEWARE = [
    'django.middleware.security.SecurityMiddleware',
    'django.contrib.sessions.middleware.SessionMiddleware',
    'django.middleware.common.CommonMiddleware',
    'django.middleware.csrf.CsrfViewMiddleware',
    'django.contrib.auth.middleware.AuthenticationMiddleware',
    'django.contrib.messages.middleware.MessageMiddleware',
    'django.middleware.clickjacking.XFrameOptionsMiddleware',
]

前面介绍过这些中间件列表是由load_middleware函数处理:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def load_middleware(self, is_async=False):
    ...
    handler = self._get_response
    ...
    for middleware_path in reversed(settings.MIDDLEWARE):
        middleware = import_string(middleware_path)
        ...
        mw_instance = middleware(handler)
        ...
        handler = mw_instance
    ...
    self._middleware_chain = handler
  • 通过get_response获取到view的函数
  • 逐一加载中间件类并实例化中间件对象
  • 将所有的中间件对象形成一个链表

中间件可以形成一个链表是因为MiddlewareMixin的结构,每个middleware包括了一个指向后续中间件的引用:

1
2
3
4
5
6
7
8
9
class MiddlewareMixin:
    ...
    
    def __init__(self, get_response):
        ...
        # 后续处理的指针
        self.get_response = get_response
        ...
        super().__init__()

每个wsgi-application的请求响应,都需要调用_middleware_chain处理:

1
2
3
4
5
6
7
def get_response(self, request):
    """Return an HttpResponse object for the given HttpRequest."""
    # Setup default url resolver for this thread
    set_urlconf(settings.ROOT_URLCONF)
    response = self._middleware_chain(request)
    ...
    return response

在werkzeug中,我们介绍过「洋葱模型」:

onion-model

可以看到django的middleware也是类似的逻辑,中间件层层包裹形成一个链,可以在穿透和返回过程中对请求和响应各进行一次额外处理。

url-router路由搜索

http的URL和view函数对应,是通过urlpatterns配置。比如下面将api-app的根路径api/和view层的index函数映射起来:

1
2
3
4
5
6
urlpatterns = [
    path('', views.index, name='index'),
]

def index(request):
    return HttpResponse("Hello, Python 2. You're at the index.")

django提供了一个些实现,可以根据请求的URL查找到业务View的函数。前面的load_middleware中就是通过_get_response开始查找业务View:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def _get_response(self, request):
    """
    Resolve and call the view, then apply view, exception, and
    template_response middleware. This method is everything that happens
    inside the request/response middleware.
    """
    callback, callback_args, callback_kwargs = self.resolve_request(request)
    ...
    response = callback(request, *callback_args, **callback_kwargs)
    ...
    return response

resolver对象根据request.path_info也就是URL查找View函数:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def resolve_request(self, request):
    """
    Retrieve/set the urlconf for the request. Return the view resolved,
    with its args and kwargs.
    """
    # Work out the resolver.
    ...
    resolver = get_resolver()
    # Resolve the view, and assign the match object back to the request.
    resolver_match = resolver.resolve(request.path_info)
    request.resolver_match = resolver_match
    return resolver_match

resolver对象通过下面的方式创建:

1
2
3
4
5
6
7
8
def get_resolver(urlconf=None):
    if urlconf is None:
        urlconf = settings.ROOT_URLCONF
    return _get_cached_resolver(urlconf)

@functools.lru_cache(maxsize=None)
def _get_cached_resolver(urlconf=None):
    return URLResolver(RegexPattern(r'^/'), urlconf)
  • 这里使用了lru_cache来缓存URLResolver对象提高效率

resolve主要使用url的匹配规则查找配置的url_patterns:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def resolve(self, path):
    path = str(path)  # path may be a reverse_lazy object
    tried = []
    match = self.pattern.match(path)
    if match:
        new_path, args, kwargs = match
        for pattern in self.url_patterns:
            try:
                sub_match = pattern.resolve(new_path)
            except Resolver404 as e:
                ...
            return ResolverMatch(
                            sub_match.func,
                            sub_match_args,
                            sub_match_dict,
                            sub_match.url_name,
                            [self.app_name] + sub_match.app_names,
                            [self.namespace] + sub_match.namespaces,
                            self._join_route(current_route, sub_match.route),
                            tried,
                        )

url_patterns可以进行二级递归加载:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
 @cached_property
def urlconf_module(self):
    if isinstance(self.urlconf_name, str):
        return import_module(self.urlconf_name)
    else:
        return self.urlconf_name

@cached_property
def url_patterns(self):
    # urlconf_module might be a valid set of patterns, so we default to it
    patterns = getattr(self.urlconf_module, "urlpatterns", self.urlconf_module)
    try:
        iter(patterns)
    except TypeError as e:
        ...
    return patterns

下面代码示例了在project中,导入app的url,形成一个project-app的二级树状结构。这样的结构设计和flask的蓝图非常类似,对于组织大型web项目非常有用。

1
2
3
4
5
6
7
from django.contrib import admin
from django.urls import path, include

urlpatterns = [
    path('admin/', admin.site.urls),
    path('api/', include('api.urls')),
]

我们可以发现django的router实现实际上是一种懒加载模式,有请求对象才开始初始化。ResolverMatch的算法实现,我们以后再行详细介绍。

请求(request)和response响应

我们再观测django的view函数:

1
2
def index(request):
    return HttpResponse("Hello, Python 2. You're at the index.")
  • 请求是一个request对象
  • 返回是一个HttpResponse对象

在WSGIHandler中有定义request的类是WSGIRequest:

1
2
class WSGIHandler(base.BaseHandler):
    request_class = WSGIRequest

request类的继承关系如下:

1
2
3
+-------------+   +-------------+   +--------+
| WSGIRequest +---> HttpRequest +---> object |
+-------------+   +-------------+   +--------+

对于request我们跟踪一下http协议的header,query和body如何传递到request对象, 都是通过env对象传递。比如query对象是这样封装到wsgi-env上:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
class WSGIRequestHandler(BaseHTTPRequestHandler):
    
    def get_environ(self):
        if '?' in self.path:
            path,query = self.path.split('?',1)
        else:
            path,query = self.path,''

        env['PATH_INFO'] = urllib.parse.unquote(path, 'iso-8859-1')
        env['QUERY_STRING'] = query

wsgi的env会传递给wsgi-application:

1
2
3
4
5
6
class BaseHandler:
    
    def run(self, application):
        self.setup_environ()
        self.result = application(self.environ, self.start_response)
        self.finish_response()

在WSGIRequest中从wsgi-env中读取query放到GET属性上:

1
2
3
4
5
@cached_property
def GET(self):
    # The WSGI spec says 'QUERY_STRING' may be absent.
    raw_query_string = get_bytes_from_wsgi(self.environ, 'QUERY_STRING', '')
    return QueryDict(raw_query_string, encoding=self._encoding)

比如body是在stdin上,也是封装在env的 wsgi.input key上:

1
2
3
4
5
6
7
class BaseHandler:

    def setup_environ(self):
        """Set up the environment for one request"""
        ...
        env['wsgi.input']        = self.get_stdin()
        ...

在WSGIRequest中使用LimitedStream封装一下:

1
2
3
class WSGIRequest(HttpRequest):
    def __init__(self, environ):
        self._stream = LimitedStream(self.environ['wsgi.input'], content_length)

这样body就从这个流中读取:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
@property
    def body(self):
        if not hasattr(self, '_body'):
            ...
            try:
                self._body = self._stream.read()
            except OSError as e:
                raise UnreadablePostError(*e.args) from e
            self._stream = BytesIO(self._body)
        return self._bod

http的响应处理则不太一样,view返回的是普通的HttpResponse对象,在wsgi框架中将其转换写入到stdout中:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class SimpleHandler(BaseHandler):
    def _write(self,data):
        result = self.stdout.write(data)
        ...
        while True:
            data = data[result:]
            if not data:
                break
            result = self.stdout.write(data)
...

def finish_response(self):
    """Send any iterable data, then close self and the iterable

    Subclasses intended for use in asynchronous servers will
    want to redefine this method, such that it sets up callbacks
    in the event loop to iterate over the data, and to call
    'self.close()' once the response is finished.
    """
    try:
        if not self.result_is_file() or not self.sendfile():
            for data in self.result:
                self.write(data)
            self.finish_content()
    except:
        ...
    else:
        # We only call close() when no exception is raised, because it
        # will set status, result, headers, and environ fields to None.
        # See bpo-29183 for more details.
        self.close()

具体到http模块下的HttpRequest和HttpResponse的实现,就是比较存粹的python对象,我认为这也是sansio的实现。

sansio.Request是non-IO理念的HTTP request实现,希望IO和逻辑像三明治(sandwich)一样,分层in-IO/业务逻辑/out-IO三层。这种方式实现的Request对象比较抽象,不涉及io和aio具体实现,比较通用,而且可以 快速测试 。如果wsgi的实现,推荐使用上层的 werkzeug.wrappers.Request

sansio

小结

通过对django的wsgi模块进行分析,我们梳理了tcp,http和wsgi的三层关系,了解了wsgi-application和wsgi-handler之间的委托方式。然后分析了中间件(middleware)的构造,加载方式和多个中间件如何构成中间件链表并对请求进行额外的处理。分析了如何通过http的URL路由到对应的view函数,进行业务响应。最后对django的request和response对象如何和http协议进行结合进行了简单分析。

完成上面四个步骤的分析后,我们可以知道远程的http请求如何传递到业务view函数并进行响应返回的整个流程。

参考链接