I'm sending the same query to a large (well, okay, n=5) number of identical endpoints (a few kubernetes clusters) and collating the results together. I want the requests to go out in parallel and time out after a while; I want failures to be reported to the user without impeding further progress.
As an example of the desired output:
$ node find_broken_pods.js
tomato: error: timed out
Cluster Pod Reason
potato bar-foos Invalid PVC
potato foo-eggnog Invalid image
yoghurt spam-baz Invalid name
$ node find_broken_pods.js
Cluster Pod Reason
potato bar-foos Invalid PVC
potato foo-eggnog Invalid image
yoghurt spam-baz Invalid name
tomato what-bat Insufficient jQuery
The first time around, we failed to get an answer from the tomato cluster for whatever reason, but we were able to enumerate the resources from the other clusters. The second time around, there was no timeout and we were able to list everything.
I previously had come up with this:
export async function queryAll(): Promise<{cluster: string; deployments: V1Pod[]}[]> {
const out: {cluster: string; result: V1Pod[]}[] = [];
const promises: Promise<number>[] = [];
for (const cluster of Object.values(CLUSTERS)) {
promises.push(
Promise.race([
new Promise<number>((_, reject) => setTimeout(() => reject(new Error(`${cluster}: timed out`)), 5000)),
new Promise<number>((resolve, _) =>
getAllPods(cluster)
.then(pods => resolve(out.push({cluster: cluster, result: pods})))
),
])
);
}
await Promise.all(promises);
return out;
}
This does run everything in parallel, but a single failure results in the entire function crashing and burning. I figured I could change it to this:
export async function queryAll(): Promise<{cluster: string; deployments?: V1Deployment[]; error?: string}[]> {
const out: {cluster: string; result?: V1Pod[]; error?: string}[] = [];
const promises: Promise<number>[] = [];
for (const cluster of Object.values(CLUSTERS)) {
promises.push(
Promise.race([
new Promise<number>((resolve, _) =>
setTimeout(() => {
resolve(out.push({cluster: cluster, error: 'timed out'}));
}, 5000)
),
new Promise<number>((resolve, _) =>
getAllPods(cluster)
.then(pods => resolve(out.push({cluster: cluster, result: pods})))
),
])
);
}
await Promise.all(promises);
return out;
}
However, I'm now witnessing both arms of the promise run to completion, that is the out array contains (some) clusters reporting data and either all or none of my clusters reporting a timeout:
- if no timeout is reached,
Promise.allwon't wait for thesetTimeouts, but the node process will (i.e. if I set the timeout to 60 seconds then the node process will only exit after 60 seconds) - if any timeout is reached,
Promise.allwill wait for the timeout to happen... meaning allsetTimeouts end up triggering.
I instead expected that the losing arm of the Promise.race would be killed somehow or prevented from running.
Something tells me my approach is fundamentally broken... how can I improve my fault tolerance?
CodePudding user response:
Yes, you've got both tasks running and pushing objects to the out array. Nothing is stopping them - Promise.race does not magically undo the steps that were taken to create the promises passed to it.
You can solve this with a flag for each cluster whether the timeout or result has already been reported so that the loser of the race won't report their status.
However, it's much simpler to just use the result value that the promises fulfill with, which Promise.race already does forward:
interface QueryResult {cluster: string; deployments?: V1Deployment[]; error?: string}
export async function queryAll(): Promise<QueryResult[]> {
const promises: Promise<QueryResult>[] = Object.values(CLUSTERS).map(cluster =>
Promise.race([
new Promise(resolve => {
setTimeout(() => {
resolve({cluster: cluster, error: 'timed out'});
}, 5000)
}),
getAllPods(cluster).then(pods => ({cluster: cluster, result: pods}))
])
);
const out = await Promise.all(promises);
return out;
}
With this approach, you even get the results consistently in the same order as the CLUSTERS, not in the order they came back. If you don't want that, you can still take your original approach with push - just do it with the result of each race!
export async function queryAll(): Promise<QueryResult[]> {
const out: QueryResult[] = [];
await Promise.all(Object.values(CLUSTERS).map(cluster =>
Promise.race([
new Promise(resolve => {
setTimeout(() => {
resolve({cluster: cluster, error: 'timed out'});
}, 5000)
}),
getAllPods(cluster).then(pods => ({cluster: cluster, result: pods}))
]).then(result => {
out.push(result);
})
));
return out;
}
Btw, also consider Promise.allSettled instead of Promise.all - you might as well reject from your timeout promise with that.
CodePudding user response:
Looks like this is close enough for my needs, but still not it:
export async function queryAll(): Promise<{cluster: string; deployments: V1Pod[]}[]> {
const out: {cluster: string; result: V1Pod[]}[] = [];
const promises: Promise<number>[] = [];
for (const cluster of Object.values(CLUSTERS)) {
promises.push(
Promise.race([
new Promise<number>((_, reject) => setTimeout(() => reject(new Error(`${cluster}: timed out`)), 5000)),
new Promise<number>((resolve, _) =>
getAllPods(cluster)
.then(pods => resolve(out.push({cluster: cluster, result: pods})),
error => {
process.stderr.write(`${cluster}: ${error}\n`);
resolve(0);
})
),
]).catch(error => {
process.stderr.write(`${cluster}: ${error}\n`);
return 0;
})
);
}
await Promise.all(promises);
return out;
}
This can result in this output as a result of duplicated error handling, which I don't mind so much; for example, if I force a failure with /etc/hosts, I get:
$ node find_broken_pods.js
tomato: Error: timed out
Cluster Pod Reason
potato bar-foos Invalid PVC
potato foo-eggnog Invalid image
yoghurt spam-baz Invalid name
tomato: Error: connect ECONNREFUSED 0.0.0.0:443
What I'm not so happy about is that node specifically waited long enough for whatever library to throw that ECONNREFUSED, which happened a few seconds after my program was done generating its output; part of why I'm doing this in the first place is that there are failure modes in which the process just hangs waiting for a server to respond to a request that got eaten by a firewall.
