partial designdoc or something..

this is insanely large.. feels like building an API company now..
there's no real other users, this is waste of time, maybe practice..
why am I doing this
This commit is contained in:
jtagcat 2020-12-29 03:56:08 +02:00
parent 1dec0025c6
commit 7ae32c65dc
1 changed files with 246 additions and 0 deletions

246
spec.md Normal file
View File

@ -0,0 +1,246 @@
Gonna use SQLite for queue and output storage. Handling write locks is more difficult, but at the same time it has the added benefit of not losing data with container restarts etc, and absolute references.
One endpoint per blade enclosure seems reasonable. This way it can scale.
A blade list aggregator, or etc can be a seperate thing (ex: prom plugin). If desired in the future, all endpoints may report back to a central enc / blade listing.
# Main
1. init()
1. async
- rest_host() &
- runner() &
- queue_cleanup() &
- ssh_cleanup() &
## init()
1. Check for SSH private and public key. Possible credentials? &
1. Check for blade enclosure hostname/endpoint. &
1. Check for a database, and that we can access it with rw. Create if not exists.
1. init_queue() &
1. init_connections() &
### init_queue()
1. Create queue tables if not exist.
1. Run queue_cleanup()
### init_connections()
1. Create connections table if not exist.
1. Carry over enclosure hostname/endpoint. It is bay 0.
1. set all lock and connection_active to false
1. Attempt to exec_blade_list() # POST
# rest_host()
## /api/v1/exec/create POST
Queues a command for execution. Gives information on how and when to query for the command's status/output.
### Request
- what am I, prom-monitor, remote control provisioner, etc, 'command owner'
- where to run (choose exactly 1):
- int: blade bay nr to run on
- str: blade serial to run on
- bool: run on all blades?
- bool: run on enc?
- str: command to run
#### Server query ticket DBtable for a count of tickets not executed.
### Response
- 503: The queue is too large. Possibly try again later.
- int: current_estimated_queue_runtime in seconds. This is the predicted time, where the endpoint has had time to deal with most if not all currently queued request. Retry-After: Header might be used, though this might take away the ability for the requester to drop their exec request altogether, ex if it's stats.
### Response
- 422: Craft your commands better.
- Error, on what parameter, what went wrong.
### Response
- 202: Queued
- int: ticket
- rando generated leads to possible crypto collision, just count up, there's no desire for auth (and it would be implemeneted differently as well)
- Re-use ints. It probably is better to count up until a large number, then get the smallest number in queue, then
- int: Recommended time on when back to check back, in seconds from current time.
- (Retry-After: can't be used for this, as that is meant for non-async requests. Though! You could possibly use Retry-After: combined with a 302 redirect to the ticket, but as you can do other things with the ticket, this is not smart.)
- The requester may choose to convert this in to epoch, or other time format. Using epoch here would add a dependancy on clock sync.
- x is given by the http listener, because:
- 'don't repeat yourself', central database (probably just written in code) for the command's execution, and the command's dependancies.
- Have a default value.
- Some command dependancies are dynamic as well, and may even change while named command is executed. Best case, you get a reply in 3s, worst case you have to request your response 2, in extreme cases 3 times.
- Requester should retry getting results up to 5 times. After which, they should make a different (ticket withdrawal) request.
- The endpoint also has information on the queue position. if it already has a direct connection to the blade.
- int: ticket expiry, in seconds from current time. This is the time when all information about the ticket is planned to be removed.
- This should be current_estimated_queue_time+200(enc iLO reset)+50(iLO reset)+3*estimated_command_runtime+20s(processing, extra).
- current_estimated_queue_runtime is the minimum. Possible resets account for needed cache refresh times.
- with extreme demand, entropy might seap in and commands may still timeout; to counter this, the endpoint has a limit on queue size.
#### Server add to queue DBtable:
- int ticket: 5
- int epoch ticket_expiry: 1320918230
- int run_blade_bay: 5 (only one can be set/true)
- str run_blade_serial: 'XYZ' (only one can be set/true)
- bool run_blade_all: false (only one can be set/true)
- bool run_enc: false (only one can be set/true)
- str exec_command: "command to be execed".
- bool exec_completed: false
## /api/v1/exec/withdraw/<ticket> PUT
Not implemented. Future implementation planned.
If requester isn't interested about the ticket anymore. Gave up on waiting, or output not relevant. They might try again using a different ticket.
This is probably because something has gone wrong on the responder. As such, the requester shouldn't demand or test for success.
### Request
- int: ticket
#### Server query DB for ticket in queue DBtable
- int ticket number (does it exist?)
### Response
- 404: Not found. The ticket mighth've expired.
### Response
- 200: Ticket successfully withdrawn.
#### Server delete row from queue DBtable where ticket ID is <ticket>.
Response DBtable is not touched by design.
## /api/v1/exec/result/<ticket> GET
Attempt to get the result of previously queued command.
#### Server query in DBtable queue for the ticket.
- May fail due to ticket not found.
### Response
- 404: Not found. The ticket mighth've expired.
### Response
Authentication not implemented.
- 403: Requester is not the owner of the ticket (queued command). As such they may not get any information about it.
### Response
Requester should usually retry up to 5 times while the ticket hasn't expired.
- 409: Too early, check back later. Command result not yet available.
- int: Recommended time on when back to check back, in seconds from current time.
- int: new ticket expiry
#### Server update localdb to use new expiry in ticket DBtable.
### Response
- 200: Command executed successfully.
- bool exec_result
- str (might be multiline): exec_output
#### If output_ready: true
- Get the ticket in results DBtable, that includes:
- exec_success: bool
- exec_output: str
### Response
- 424: Failed execution. This might be due to a faulty command provided by the requestee beforehand, or declaration of failure after multiple attempts of accessing unavailable resources. The target might be absent as well (blade removed).
## queue_cleanup()
Every n (10 minutes, hour, whatever), check for old entries.
If epoch is 10s or more less than current epoch, remove the entry.
1. tickets DBtable; If possible efficiently, check if exec_completed is false, if so, issue a warn
1. results DBtable; If possible efficiently, chekc if exec_result is false, if so, issue a warn
## /metrics
- estimated_queue_runtime_seconds
- errors_total
- errors_exec_total
- errors_timeout_total
- withdravals_total
- ssh_connections_total
# runner()
Main process, that runs the things in the queue.
while True:
Query queue DBtable for first item (not lowest ticket number). If fail:
# No items in queue.
sleep(5) # might be a while
else:
shell_exec(bay, command)
The queue should be independent for all targets ('all blades' just throws a thing to all queues, and assembles results, or even just disallow 'all' for future implementation). The queue being independant (and commands added to enc queue locally take priority) allows async execution of commands. In normal operation though, a single queue is fine, as we are expecting loads of idle time, maximum of 1-3 items in the queue (and as execution time is 10-15 or 30s, even for 'all targets' commands, it doesn't queue up)
## shell_exec(bay, command, bay2)
Summons or re-uses existing connection, and runs command.
bay2 is optional, only used for connect server bay2
while connections DBtable bay lock == true:
sleep(5)
Connections DBtable: set bay last_used to current epoch
Connections DBtable: set bay lock to true
ssh_connection_check(bay)
if tests not ok:
ssh_cleanup_prune(con) # close any previous things
if bay == 0: # is enc
ssh_connect(enc, enc_address)
if ssh_connect[0] == false: # connection failed
else:
ssh_connect_blade(bay) # connect directly to blade # needs to rework, else we repeat ourselves
#TODO: needs splitting off, if doing inception connections with blades
set connectons DBtable bay lock == false
### ssh_connect_blade(bay)
Attempts to connect directly to a blade.
ssh_connect(blade, address)
if failed: # creds expired, or unavailable
warn
#TODO: connect to enc, but then connect to bay, then run command, return the result, but attempt to setup direct ssh thereafter with the same connection
#TODO:
### ssh_connect(type, address)
Opens an ssh connect to requested thing.
*somehow connect*
ssh_prompt(type, reply)
return true/false
### ssh_prompt(type, reply)
Check for the right kind of prompt
types: enc, blade
#TODO:
### ssh_connection_check(bay)
1. Connections DBtable: get bay connection_active
1. Check if named SSH connection actually exist
if bay == 0: # is enc
1. Test for active shell with date, or similar command.
else:
1. Test for active shell with a command, that is only available on blades.
return true if tests succeed
## Medium-High level SSH commands
### exec_blade_list()
#TODO:
# ssh_cleanup()
Closes idle connections.
Every n (5 minutes or whatever), query connections DBtable where last_used
For con in result:
If connection DBtable con lock == true:
If connection DBtable con last_used >= 5 minutes
ssh_cleanup_prune(con)
else:
ssh_cleanup_prune(con)
## ssh_cleanup_prune(con)
ssh_run(con, 'exit') # don't wait for this
ssh_close(con)
connections DBtable: set connection_active to false