Web Server Design Problem Statement
"To provide a modular mechanism for securely delivering the required data to the AI client. The mechanism is scalable, delivers high performance, and provides a compelling user experience."
Goals
The goals listed here are scalability, performance, and compelling user experience.Listed in order of importance:
- User Experience
- scalability
- performance
Criteria for Choosing a Mechanism
- Mechanism must be scalable
- Mechanism must work across subnets
- Mechanism must be tunable
- Mechanism must provide a logging feature
- Mechanism must transport data securely
- Mechanism must allocate resources appropriately
Analysis of Problem Statement
1. Modularity
The modularity applies to the ability to easily use different mechanisms for data transfer. Whatever mechanism ischosen initially, it shouldn't preclude adding other mechanisms.
2. Mechanisms for Data Transfer
There are numerous mechanisms available for data transfer. What are the advantages and disadvantages of each.
- Point-to-Point
- HTTP:
- Apache Webserver
- Advantages - well tested, broadly used, high amount of configurability. Highly scalable - has a history of use in large production environments. Large install base means the software will not fade from existence in near future, and is already well tuned for use.
- Disadvantages - potentially more flexible than necessary (meaning bulkier). The multitude of configuration options presents more than we need. Would have to consider some sort of CGI or server-side application to select which manifest to serve -- could be Python, PERL, C, shell, etc.
- Cherrypy Webserver
- Summary:// CherryPy is actually a two-fold program. The first portion is a simplified web server. The second portion (and its "selling point" as a product, not necessarily as a solution to the problem being presented here) is an object oriented Python development toolkit. Because it is segmented as such, the Python features can be hooked into other web servers besides the one provided with CherryPy - web servers including Apache.
- How fast is CherryPy?
- Advantages:// Easier to configure, performance range may be within the limits of what we desire (the CherryPy page states estimated 400-500 responses/sec under "reasonable configurations" From the CherryPy website). Additionally, could theoretically configure Apache as the web server, and take advantage of CherryPy as an object oriented framework, to get the "best of both worlds"
- Disadvantages:// Concurrent connections may not be as robust as other solutions such as Apache. CherryPy's website itself recommends use of other web servers in enterprise environments. CherryPy website recommends using Apache as the front-end webserver for enterprise conditions and for static content delivery.
- SQUID Proxy
- Does it make sense to use Apache for static data and Cherrypy for dynamic data? Apache with cherrypy module -- something like this is what serves pkg.os.org? Why can't we use Apache for all files? Do we have enough dynamic data to justify Apache with Cherrypy? Further, should Apache just be used as a cache. Could we mirror/cache more like an IPS repo?
- Secure Copy
- FUSE supported filesystems (sshfs, davfs, etc.)
- FTP (including TFTP, SFTP, Anonymous FTP)
- Point-to-Many (traditional multicast)
- Linux and Apple are using multicast. We would have to change our client/server mechanism in order to use this protocol. It could be interesting in the future.
- Many-to-Many
- bittorrent - Use of bittorent requires 2 components: (1) A central tracker, which manages peer connections, and (2) a system of peers. The client code is available in the SUNWtransmission package. Tracker code is written in python (among other languages) and could be ported. Another minor requirement is delivery of a .torrent file to all the peers - this detail could present a "chicken/egg" problem, as the .torrent file will need to be sent to the clients somehow.
- Advantages: Distributes network load across multiple systems, de-stressing the AI server.
- Disadvantages: Distributes network load across multiple systems, stressing the clients. Offline clients cannot participate. Must implement method for delivering initial .torrent file. For maximum effectiveness, each client must maintain a copy of every AI image to be used on the network.
- GNUtella - GNUtella is a protocol for searching for files on an extended network. While the search is peer2peer, the file transfer is generally done via HTTP. As such, it is not a method of file transfer in and of itself.
3. Required Data
- the initial boot file
- X86: 135KB PXE GRUB over TFTP
- SPARC: 1.1MB for WANBoot. WANboot-cgi currently used. (WANboot could be changed)
- Boot Archive
- X86: 57MB file over TFTP acquired by GRUB
- SPARC: 170MB file over HTTP acquired by WANBoot
- Any required configuration files
- X86: <1KB file - menu.lst
- SPARC: <1KB file - wanboot.conf
- AI Manifest
- All: <10KB file -- dynamically chosen
- zlib files
- X86: 79MB (combined solaris.zlib and solarismisc.zlib)
- SPARC: 82.9MB (combined solaris.zlib and solarismisc.zlib)
- Client Criteria
- All: <10KB sent in HTTP/1.1 request (limited by maximum HTTP/1.1 POST size 2MB)
- Future data we might look at delivering
Note: Architectural Consideration - What order of downloading files would provide the most flexibility? There is no choice regarding the boot archive, boot files, and configuration files. They must be downloaded first. We do have a choice with the others.
4. Networking
- Single vs. Multiple Servers - It is quite possible that some data can come from one server and some other data can come some other server. Choosing the transport will not be affected whether the data is coming from one server or multiple server.
- Single vs. Multiple Subnets - This could affect scalability and performance. When you talk about multiple subnets, there is a router involved. Some protocols are allowed by routers by default. Some other protocols are not allowed and need to be configured. So not all mechanisms may work across subnets.
- Single vs. Multiple Clients - clients will access the mechanism for the data, and the mechanism on the server needs to be smart enough to add more resources to handle an increased client load. When requests go down, the resource usage on the server side should go down. (Falls under scalability and performance).
- For these network configurations, the action of the mechanism will be the same no matter which on is used:
- DHVP vs. statically configured IP Addresses
- Multicast DNS
- Unicast DNS
- Wired vs. Wireless Ethernet
- Infiniband
5. Scalability
- One client and one server
- Serving the same image to many systems (HPC, for example)
- Serving many images to many systems ( co-located Data Center, for example)
- Tunable - The mechanism should be tunable according to the environment. If only one client/server, it should be tunable so that it doesn't use too many resources. And as the load increases, it should add resources. As the load decreases, it should release the resources. We need to answer the question: Which mechanisms are scalable? (can be tuned) Which ones aren't?
- Provisioning
Something that can be done quickly to set up systems. It can quickly install a number of systems. Provides good performance over a short period of time. - Types of installs - the mechanism we use might be used in all of these cases. They all require good performance.
Interactive install: gui install; text install
Replication install - Volume of installs - Provide good performance and reliability over a long period of time. Want to have resources allocated appropriately. For example, the mechanism might give good performance over a short period but slow down as time increases.
- Peak Loads- Can the mechanism handle the peak load. Can the mechanism ramp up to keep up with the load.
7. Compelling User Experience
- Logging
Some mechanisms already have logging feature. If the mechanism doesn't have a logging feature, the requirement is that some type of logging will be made available. - Setup Tools
Should be simple and straight forward. If possible, automatic setup. The mechanism must be set up before the AI client talks to the server.
Note: This may not be part of the webserver project. This may be a requirement that needs to be relayed to another installadm design. - Observability tools
Note: not exactly part of web server. We need to relay to installadm design. We need to tell them where to log files. Requirement for Webserver Design for installadm design to be able to identify what mechanism we are using.
May also include Monitoring log files on the server side and detecting fatal errors.
Status sent from the client the server. Have the ability of pushing the client log files to the server. - Maintenance tools
Tools for tuning the mechanism. Could be a simple set of instructions, or it could be a command. - Status tools
We should be plugged into smf. The transport mechanism should be an smf service. installadm list should list the clients as well as their transport mechanism
server - Could show the status of what is happening. We could look at paramters and see how the resources are being used
client - also want to show the status of what is happening. - fault management on both client and server side:
We need to evaluate how to recover from infrastructure failure.
What do you do in the case of data corruption failure.
Incorrect user configuration.
8. Secure Delivery
- Does the transport offers end-end security?
- DHCP -- none, only obscurity
- TFTP -- none
- HTTP -- end to end via SSL
- BitTorrent -- end to end via SSL
- Client side security
- Encryption -- to prevent data tampering
- Authentication -- to ensure server is correct and expected server
- Server side security
- Encryption -- to prevent data leaking
- Authentication -- to ensure only authorized clients receive data
- Non-Repudiation -- to ensure client got all data sent
- What special setup needed to enable security?
- Key management
- Server side (setting up SSL keys)
- Client side (distributing keys to clients)
- X86: perhaps a CD or USB
- SPARC: WAN Boot has ways of storing key in OBP
- Certificate host?
Questions we need to answer
- What transport issues could occur when the auto installation process is scaled up to multiple clients?
1* It depends on the number of clients and the amount of server resources available. If the server cannot handle all the requests, there will be performance issues, and reliability issues.
1* It depends on where in the process the failure occurs
1 Initial boot:
1* X86: Uses DHCP and TFTP if one step fails to respond in time, the boot fails
1* SPARC: Uses DHCP and HTTP. If one step fails the boot is retried, perhaps 10 times?
1 Boot archive downloads
1* All: If the download fails, wget(1) can retry, BitTorrent could retry too. This would be specified in the SMF method under the auto-install service - What factors limit the number of simultaneous client installations, and how can those be scaled?
1* Server resources
1* limitation on the transport we cannot change
1* The inability of the transport to adapt to the scaling - What steps will a user need to follow in order to set up an AI for multiple clients?
1* This depends on the other tools (installadm etc.)
1* If we are sticking to transport, the user needs to tune the configuration files needed by the transport depending on the need of the environment - What type of observability tools could we provide to the user that would allow the user to observe the installation, failures that occur in the client, the server, or the data transfer?
1* The transport logs
1* Status monitor for all clients
1* Tool(s) to monitor client transmission to catch status and failures
1* server failure will make the client fail or hang during data transfer - How will we scale the observability tools to handle simultaneous client installations?
- How will we scale the observability tools if multiple clients are installing from multiple install servers? What about from multiple networks?
- Where is the user located during the installation process? We need to determine this.
- If you have 300 client installations and 35 fail, how do you determine which 35 clients failed?
- What tools do we currently have that might help us develop an observability tool?
- 1* Logging in /tmp
1* Networking
1* SNMP traps
1* Others? - There two phases we need to consider:
1* Before Solaris boots, and the client is in PXEgrub
1* After the Solaris boots - What failure scenarios should we consider?
1* mechansim could go down
1* Install Server could go down
1* Client could go down - We need to think about how to make sure that the user can observe what is happening when you have a large number of clients:
1* What points are observable?
1* How will we observe those points?
1* Is there value in observing a particular point?
1* What are the characteristics/attributes a user wants to observe?
1* How will we provide the useful information to the user?
1* What is observable on the client side, the server side, and the interaction between the two? - What are the logical phases for implementing this design?