From 0fc3c4069baf4d09f7a3d56d1c2b901693f19f4a Mon Sep 17 00:00:00 2001
From: Sandeep Dasgupta <sdasgup@google.com>
Date: Thu, 22 Jun 2023 22:16:49 +0000
Subject: [PATCH 01/11] rfc for reducntion ops

---
 rfcs/20230622-quantized-reduction.md | 521 +++++++++++++++++++++++++++
 1 file changed, 521 insertions(+)
 create mode 100644 rfcs/20230622-quantized-reduction.md
diff --git a/rfcs/20230622-quantized-reduction.md b/rfcs/20230622-quantized-reduction.md
new file mode 100644
index 00000000000..e38db43203f
--- /dev/null
+++ b/rfcs/20230622-quantized-reduction.md
@@ -0,0 +1,521 @@
+# RFC: StableHLO quantization for reduction ops
+
+Status: Review<br/>
+Initial version: 06/22/2023<br/>
+Last updated: <br/>
+Discussion thread: [GitHub](https://github.com/openxla/stablehlo/pull/1664)
+
+## Version log
+
+* 06/22/2023: Initial version.
+
+## Introduction
+
+The [reduce](https://github.com/openxla/stablehlo/blob/main/docs/spec.md#reduce)
+op, for non-quantized types, has constraints like
+
+```python
+(C2) element_type(inputs...) = element_type(init_values...) = element_type(results...).
+(C6) body has type tensor<E0>, ..., tensor<EN-1>, tensor<E0>, ..., tensor<EN-1>) -> (tensor<E0>, ..., tensor<EN-1>) where Ei = element_type(inputs[i]).
+```
+
+which constrained the signature of reduce op and its associated reducer function
+`body` to have the same element types for `inputs`, `results` and arguments and
+return for `body`. For reducer function performing an accumulative operation like
+add, this means that the the result of accumulation can overflow in which case
+the result will be implementation defined (e.g.,
+[saturated](https://en.wikipedia.org/wiki/Saturation_arithmetic) or
+[wrap around](https://en.wikipedia.org/wiki/Integer_overflow)).
+From the conversation with customers it seems a reasonable behavior for non
+quantized data types. However, with quantized data types, such loss in precision
+is not acceptable and hence the motivation is to perform the accumulation in
+some higher data type.
+
+The RFC highlights some of the options emerged out of discussion in the
+[thread](https://github.com/openxla/stablehlo/pull/1538#issuecomment-1599476906)
+along with their tradeoffs. The proposal option #1 looks promising at this
+point, but we are open to further discussion on this.
+
+## Option 1: Introduce additional conversion functions
+
+[The thread](https://github.com/openxla/stablehlo/pull/1538#issuecomment-1599476906)
+discuses an option, proposed by @loganchien, on how to achieve the structural
+changes as mentioned above. We note that some of the examples/diagrams presented
+here are borrowed from an internal doc @loganchien authored.
+
+The proposed options introduces on-the-fly type conversions, which (1) convert
+the input type to the type of the `body` function argument and (2) convert the
+result type of the `body` function to the output type. Following is the code
+snippet with the proposed syntax of reduce op:
+
+```mlir
+%result = "stablehlo.reduce"(%input, %init_value) ({
+    ^input_conversion(
+            %input: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>):
+        %input_rescaled = "stablehlo.uniform_quantize"(%input)
+            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>)
+            -> tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>
+        "stablehlo.return"(%input_rescaled)
+            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>) -> ()
+
+    }, {
+    ^reduce_computation(
+            %lhs: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>,
+            %rhs: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>):
+        %add = "stablehlo.add"(%lhs, %rhs)
+            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>,
+               tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>)
+            -> tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>
+        "stablehlo.return"(%add)
+            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>) -> ()
+    }, {
+    ^output_conversion(
+            %intermediate_result: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>):
+        %output_rescaled = "stablehlo.uniform_quantize"(%intermediate_result)
+            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>)
+            -> tensor<!quant.uniform<ui8:f32, output_scale:output_zp>>
+        "stablehlo.return"(%output_rescaled)
+            : (tensor<!quant.uniform<ui8:f32, output_scale:output_zp>>) -> ()
+    }) {
+        dimensions = dense<...> : tensor<1xi64>
+    } : (tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>,
+         tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>)
+    -> tensor<... x !quant.uniform<ui8:f32, output_scale:output_zp>>
+```
+
+### Semantics
+
+Here we will informally propose the semantics of the additional functions
+`input_conversion` and `output_conversion` introduced.
+
+```python
++----------+  +--------+ +--------+    +----------+  +--------+ +--------+
+|init_value|  |input[0]| |input[1]|    |init_value|  |input[2]| |input[3]|
++----------+  +--------+ +--------+    +----------+  +--------+ +--------+
+    |             |          |               |           |          |
++----------+  +--------+ +--------+    +----------+  +--------+ +--------+
+|input     |  |input   | |input   |    |input     |  |input   | |input   |
+|convert   |  |convert | |convert |    |convert   |  |convert | |convert |
++----------+  +--------+ +--------+    +----------+  +--------+ +--------+
+      \      /           /                   \      /           /
+      +-------+         /                    +-------+         /
+      |compute|        /                     |compute|        /
+      +-------+       /                      +-------+       /
+             \       /                              \       /
+              +-------+                              +-------+
+              |compute|                              |compute|
+              +-------+                              +-------+
+                     \___________           ___________/
+                                 \         /
+                                  +-------+
+                                  |compute|
+                                  +-------+
+                                      |
+                                  +-------+
+                                  |output |
+                                  |convert|
+                                  +-------+
+```
+
+### Semantics of `input_conversion` block
+
+The `input_conversion` block is applied selectively to the leaf nodes of a
+schedule tree as shown in above diagram. Note that the `input_conversion` cannot
+be applied to the non-leaf nodes of the schedule tree.
+
+### Semantics of `output_conversion` block
+
+The `output_conversion` block is applied just after the `result` for a particular
+index is computed as shown in the above diagram.
+
+Please refer to the [formal spec](#specification-of-reduce-op) of the proposed
+reduce op.
+
+### Implementation details
+
+From the implementation POV of the proposed spec, we note that
+`input_conversion` and `output_conversion` can very well be optional with
+default values as identity functions. For example, the following code snippet
+
+```mlir
+%result = "stablehlo.reduce"(%input, %init_value) ({
+    ^reduce_computation(
+            %lhs: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>,
+            %rhs: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>):
+        %add = "stablehlo.add"(%lhs, %rhs)
+            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>,
+               tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>)
+            -> tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>
+        "stablehlo.return"(%add)
+            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>) -> ()
+    }) {
+        dimensions = dense<...> : tensor<1xi64>
+    } : (tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>,
+         tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>)
+    -> tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>
+```
+
+should be interpreted as
+
+```mlir
+%result = "stablehlo.reduce"(%input, %init_value) ({
+    ^input_conversion(
+            %input: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>):
+        "stablehlo.return"(%input)
+            : (tensor<!quant.uniform<ui8:f32, input_scale:0>>) -> ()
+
+    }, {
+    ^reduce_computation(
+            %lhs: tensor<!quant.uniform<ui8:f32, input_scale:0>>,
+            %rhs: tensor<!quant.uniform<ui8:f32, input_scale:0>>):
+        %add = "stablehlo.add"(%lhs, %rhs)
+            : (tensor<!quant.uniform<ui8:f32, input_scale:0>>,
+               tensor<!quant.uniform<ui8:f32, input_scale:0>>)
+            -> tensor<!quant.uniform<ui8:f32, input_scale:0>>
+        "stablehlo.return"(%add)
+            : (tensor<!quant.uniform<ui8:f32, input_scale:0>>) -> ()
+    }, {
+    ^output_conversion(
+            %intermediate_result: tensor<!quant.uniform<ui8:f32, input_scale:0>>):
+        "stablehlo.return"(%intermediate_result)
+            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>) -> ()
+    }) {
+        dimensions = dense<...> : tensor<1xi64>
+    } : (tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>,
+         tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>)
+    -> tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>
+```
+
+Note that with default values, the  input/result type of `reduce` op matches
+with the argument or the result type of the `reduce_computation`, including the
+quantization parameters.
+
+It is important to enforce the following aspect in the implementations: both
+conversion functions must be provided or both must be missing. If this is not
+followed, and there are two functions provided in the IR, it would be difficult
+to determine which function is intended for which purpose.
+
+The existing pretty printing is currently producing the following output
+`stablehlo.reduce(%input init: %init_value) applies stablehlo.add across
+dimensions = [1] : (tensor<1x6xi64>, tensor<i64>) -> tensor<1xi64>`. IMO,
+modifying the above format, with the default conversion function, will create
+clutter. My proposal here is to follow the existing pretty printing when the
+conversion functions are "not provided". In the event, the conversion functions
+are explicitly provided, then the pretty printers will fall back to default
+generic printing,
+**even if the explicitly provided conversion functions are identity function**:
+To avoid identification of identity functions which could be tricky in general.
+
+#### Tradeoffs
+
+* (+) Enables programmers to program at (almost) baremetal. If the hardware
+  can support reduction computation in wider type (e.g. in the SIMD
+  instruction set, we typically do widening/compute/narrowing within the
+  kernel to save the memory bandwidth), the programmer can explicitly request
+  for that.
+* (-) The disadvantage of this representation is that the syntax is more
+  verbose and requires significant changes to the specification.
+
+## Option 2: re-scale input to accumulation type
+
+This option is the simplest from the POV for specification of quantized `reduce`
+op. This is adding `stablehlo.uniform_quantize` and `stablehlo.dequantize` ops
+respectively before and after reduce op which operates on the "accumulator"
+type.
+
+```mlir
+%widen = "stablehlo.uniform_quantize"(%input)
+    : (tensor<... x !quant.uniform<ui8:f32, ...>>) -> tensor<... x !quant.uniform<i32:f32, ...>>
+
+%reduce = "stablehlo.reduce"(%widen) {
+    ^reduce_computation(%lhs: !quant.uniform<i32:f32, ...>, %rhs: !qunat.uniform<i32:f32, ...>):
+        // reduce_computation_block
+    }
+    : (tensor<... x !quant.uniform<i32:f32, ...>>) -> tensor<... x !quant.uniform<i32:f32, ...>>
+
+%narrowed = "stablehlo.uniform_dequantize"(%reduce)
+    : (tensor<... x !quant.uniform<i32:f32, ...>>) -> tensor<... x !quant.uniform<ui8:f32, ...>>
+```
+
+### Tradeoffs
+
+* (+) An advantage of this option is that we only need minor changes to the
+  specification (i.e. to allow quantized types).
+* (-) The compiler must pattern match 3 operations and map them into some
+  internal representation before their compilation or execution.
+* (-) The compiler must ensure that the `stablehlo.uniform_quantize` (or
+  `stablehlo.convert` in the case of `bf16` or `f16`) is not folded before the
+  backend matches the pattern.
+  [for more information](https://github.com/openxla/stablehlo/pull/1538#issuecomment-1599476906)
+
+## Other options considered
+
+There is another option considered which did not fly well because of limited
+extensibility. Adding it just for completeness purposes.
+
+### Option 3: allow accumulator type to be different from input type
+
+The idea here is to convey the accumulator type using the `init_value` operand
+of `reduce` op. The code snippet for `reduce` looks like:
+
+```mlir
+%result = "stablehlo.reduce"(%input, %init_value) ({
+    ^reduce_computation(
+            %elem: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>,
+            %acc: tensor<!quant.uniform<i32:f32, input_scale:0>>):
+        %elem_rescaled = "stablehlo.uniform_quantize"(%elem)
+            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>)
+            -> tensor<!quant.uniform<i32:f32, input_scale:0>>
+        %add = "stablehlo.add"(%elem_rescaled, %acc)
+            : (tensor<!quant.uniform<i32:f32, input_scale:0>>,
+               tensor<!quant.uniform<i32:f32, input_scale:0>>)
+            -> tensor<!quant.uniform<i32:f32, input_scale:0>>
+        "stablehlo.return"(%0)
+            : (tensor<!quant.uniform<i32:f32, input_scale:0>>) -> ()
+    }) {
+        dimensions = dense<1> : tensor<1xi64>
+    } : (tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>,
+         tensor<... x !quant.uniform<i32:f32, input_scale:0>>)
+    -> tensor<... x !quant.uniform<i32:f32, input_scale:0>>
+```
+
+In this option, the `init_value` type and the `result` type can be different
+from the input type. The first argument of the compute block is fixed for the
+traversed element and the second argument is fixed for the intermediate
+(accumulation) result.
+
+#### Tradeoffs
+
+* (+) Make the accumulation type explicit in the IR.
+* (-) This representation imposes a limitation on the evaluation order.
+  Since we can’t express the computation between two intermediate (accumulation)
+  results, we can not arbitrarily insert `init_value` and start the
+  computation at an arbitrary location. The following shows the restricted
+  evaluation order with the method.
+
+```python
++----------+       +--------+ +--------+ +--------+ +--------+
+|init_value|       |input[0]| |input[1]| |input[2]| |input[3]|
++----------+       +--------+ +--------+ +--------+ +--------+
+           \        /         /          /          /
+           +-------+         /          /          /
+           |compute|        /          /          /
+           +-------+       /          /          /
+                \         /          /          /
+                 +-------+          /          /
+                 |compute|         /          /
+                 +-------+        /          /
+                       \         /          /
+                        +-------+          /
+                        |compute|         /
+                        +-------+        /
+                              \         /
+                               +-------+
+                               |compute|
+                               +-------+
+```
+
+## Open Question
+
+### Should we restrict the proposal #1 to quantized types only?
+
+The above proposal #1 of introducing the additional functions is theoretically
+not limited to quantized `reduce` op, but also can be applied to `reduce` op with
+non-quantized types. For example,
+
+```mlir
+%result = "stablehlo.reduce"(%input, %init_value) ({
+  ^input_conversion(%arg0: tensor<bf16>):
+    %0 = "stablehlo.convert"(%arg0): (tensor<bf16>) -> (tensor<f32>)
+    "stablehlo.return"(%0) : (tensor<f32>) -> (tensor<f32>)
+  }, {
+  ^bb0(%arg0: tensor<f32>, %arg1: tensor<f32>):
+    %0 = "stablehlo.add"(%arg0, %arg1) : (tensor<f32>, tensor<f32>) ->
+    tensor<f32>
+    "stablehlo.return"(%0) : (tensor<f32>) -> ()
+  },
+  ^output_conversion(%arg0: tensor<f32>):
+    %0 = "stablehlo.convert"(%arg0): (tensor<f32>) -> (tensor<bf16>)
+    "stablehlo.return"(%0) : (tensor<bf16>) -> (tensor<bf16>)
+  }) {
+  dimensions = dense<1> : tensor<1xbf16>
+} : (tensor<1x6xbf16>, tensor<bf16>) -> tensor<1xbf16>
+```
+
+However, it is not clear how such operations will be lowered to other IR
+representations, like HLO, which does not support such additional computation
+blocks. IMO there is no additional benefit to support such conversion
+functions for regular type given that there already exists infrastructure
+(backend support, lowering passes) to support regular types w/o conversion
+functions. My proposal here would be to restrict the support to only quantized
+types.
+
+## Appendix
+
+To provide an estimate of specification changes needed to implement option #1
+I have attempted to provide the blueprint here.
+
+### Specification of reduce op
+
+```python
+#### Semantics
+
+Applies a reduction functions `input_conversion`, `body`, and
+`output_conversion` to `inputs` and `init_values` along the `dimensions` and
+produces `results` tensors.
+
+The order of reductions is implementation-defined, which means that `body` and
+`init_values` must form a monoid to guarantee that the operation produces the
+same results for all inputs on all implementations. However, this condition
+doesn't hold for many popular reductions. E.g. floating-point addition for
+`body` and zero for `init_values` don't actually form a monoid because
+floating-point addition is not associative.
+
+More formally, `results...[j0, ..., jR-1] =
+map(output_conversion, reduce(input_slices_converted))` where:
+
+* `input_slices = inputs...[j0, ..., :, ..., jR-1]`, where `:` are inserted
+  at `dimensions`.
+* `input_slices_converted = map(input_conversion, input_slices...)`.
+* `reduce(input_slices_converted) = exec(schedule)` for some binary tree
+  `schedule` where:
+  * `exec(node) = body(exec(node.left), exec(node.right))`.
+  * `exec(leaf) = leaf.value`.
+* `schedule` is an implementation-defined full binary tree whose in-order
+  traversal consists of:
+  * `input_slices_converted...[index]` values, for all `index` in
+    `index_space(input_slices_converted)` in the ascending lexicographic order
+    of `index`.
+  * Interspersed with an implementation-defined amount of `init_values`
+    at implementation-defined positions.
+
+#### Inputs
+
+| Label | Name                | Type                                         | Constraints |
+|-------|---------------------|----------------------------------------------|-------------|
+| (I?)  | `inputs`            | variadic number of tensors                   |             |
+| (I?)  | `init_values`       | variadic number of 0-dimensional tensors     |             |
+| (I?)  | `dimensions`        | 1-dimensional tensor constant of type `si64` |             |
+| (I?)  | `input_conversion`  | function                                     |             |
+| (I?)  | `body`              | function                                     |             |
+| (I?)  | `output_conversion` | function                                     |             |
+
+#### Outputs
+
+| Name      | Type                       | Constraints |
+|-----------|----------------------------|-------------|
+| `results` | variadic number of tensors |             |
+
+#### Constraints
+
+* (C?) `same(shape(inputs...))`.
+* (C?) `element_type(inputs...) = element_type(init_values...)`.
+* (C?) `baseline_element_type(inputs...) = baseline_element_type(results...)`.
+* (C?) `0 < size(inputs) = size(init_values) = size(results) = N`.
+* (C?) `0 <= dimensions < rank(inputs[0])`.
+* (C?) `is_unique(dimensions)`.
+* (C?) `input_conversion` has type `tensor<E0>, ..., tensor<EN-1> ->
+       (tensor<E'0>, ..., tensor<E'N-1>)` where `Ei = element_type(inputs[i])`.
+* (C?) `body` has type `tensor<E0>, ..., tensor<EN-1>, tensor<E0>, ...,`
+       `tensor<EN-1>) -> (tensor<E0>, ..., tensor<EN-1>)` where
+       `Ei = element_type(output_types(input_conversion)[i])`.
+* (C?) `output_conversion` has type `tensor<E0>, ..., tensor<EN-1> ->
+       (tensor<E'0>, ..., tensor<E'N-1>)` where
+       `E'i = element_type(results[i])`.
+* (C?) `element_type(output_types(input_conversion)...) =
+       element_type(input_types(output_conversion)...)`.
+* (C?) `shape(results...) = shape(inputs...)` except that the dimension
+  sizes of `inputs...` corresponding to `dimensions` are not included.
+```
+
+The above specification of `reduce` op can be used to define the specification
+of other ops as shown below. For brevity, we are only presenting the relevant
+portions of the spec which needs modification.
+
+### reduce_window
+
+```python
+#### Semantics
+
+Applies a reduction functions `input_conversion`, `body`, and
+`output_conversion` to windows of `inputs` and `init_values` and produces
+`results`.
+
+...
+
+More formally,
+`results...[result_index] = reduce(windows, init_values, axes(inputs...),
+        input_conversion, body, output_conversion)`
+where:
+....
+
+#### Inputs
+
+| Label | Name                | Type     |
+|-------|---------------------|----------|
+| (I?)  | `input_conversion`  | function |
+| (I8)  | `body`              | function |
+| (I?)  | `output_conversion` | function |
+
+#### Constraints
+
+* (C?) `element_type(inputs...) = element_type(init_values...)`.
+* (C?) `baseline_element_type(inputs...) = baseline_element_type(results...)`.
+* (C?) `input_conversion` has type `tensor<E0>, ..., tensor<EN-1> ->
+       (tensor<E'0>, ..., tensor<E'N-1>)` where `Ei = element_type(inputs[i])`.
+* (C?) `body` has type `tensor<E0>, ..., tensor<EN-1>, tensor<E0>, ...,`
+       `tensor<EN-1>) -> (tensor<E0>, ..., tensor<EN-1>)` where
+       `Ei = element_type(output_types(input_conversion)[i])`.
+* (C?) `output_conversion` has type `tensor<E0>, ..., tensor<EN-1> ->
+       (tensor<E'0>, ..., tensor<E'N-1>)` where
+       `E'i = element_type(results[i])`.
+* (C?) `element_type(output_types(input_conversion)...) =
+       element_type(input_types(output_conversion)...)`.
+```
+
+### select_and_scatter
+
+This op originally takes two function arguments `select` and `scatter`. As the
+`select` function is supposed to perform a non-accumulative operation, we may
+not need additional conversion functions associated with `select`. But the
+`scatter` function needs be accompanied with `input_conversion` and
+`output_conversion` functions.
+
+```python
+#### Semantics
+
+Scatters the values from the `source` tensor using `scatter` based on the
+outcome of `reduce_window` of the `input` tensor using `select` and produces
+a `result` tensor.
+
+More formally:
+...
+* `result[result_index] = reduce([source_values], [init_value], [0],
+        input_conversion, scatter, output_conversion)`
+ where:
+ ...
+
+#### Inputs
+
+| Label | Name                | Type     |
+|-------|---------------------|----------|
+| (I8)  | `input_conversion`  | function |
+| (I8)  | `scatter`           | function |
+| (I8)  | `output_conversion` | function |
+
+#### Constraints
+
+<!-- markdownlint-disable line-length -->
+* (C1) `element_type(operand) = element_type(source)`.
+* (C3) `element_type(init_value) = element_type(operand)`.
+* (C?) `baseline_element_type(inputs...) = baseline_element_type(results...)`.
+* (C?) `input_conversion` has type `tensor<E> -> (tensor<E'>)` where
+       `Ei = element_type(operand)`.
+* (C10) `scatter` has type `(tensor<E>, tensor<E>) -> tensor<E>` where
+        `E = element_type(output_types(input_conversion))`.
+* (C?) `output_conversion` has type `tensor<E> -> (tensor<E'>)` where
+       `E'i = element_type(result)`.
+* (C?) `element_type(output_types(input_conversion)) =
+       element_type(input_types(output_conversion))`.
+* (C11) `shape(operand) = shape(result)`.
+<!-- markdownlint-enable line-length -->

From b6839c9cac716d720204c86a2343bf67a4f37fa3 Mon Sep 17 00:00:00 2001
From: Sandeep Dasgupta <sdasgup@google.com>
Date: Mon, 3 Jul 2023 01:07:49 +0000
Subject: [PATCH 02/11] any of on the conversion fuction can be optional + typo
 fixes

---
 rfcs/20230622-quantized-reduction.md | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/rfcs/20230622-quantized-reduction.md b/rfcs/20230622-quantized-reduction.md
index e38db43203f..c5b1be8c4d6 100644
--- a/rfcs/20230622-quantized-reduction.md
+++ b/rfcs/20230622-quantized-reduction.md
@@ -2,7 +2,7 @@
 
 Status: Review<br/>
 Initial version: 06/22/2023<br/>
-Last updated: <br/>
+Last updated: 07/02/2023<br/>
 Discussion thread: [GitHub](https://github.com/openxla/stablehlo/pull/1664)
 
 ## Version log
@@ -134,7 +134,7 @@ reduce op.
 ### Implementation details
 
 From the implementation POV of the proposed spec, we note that
-`input_conversion` and `output_conversion` can very well be optional with
+`input_conversion` or `output_conversion` can very well be optional with
 default values as identity functions. For example, the following code snippet
 
 ```mlir
@@ -190,10 +190,10 @@ Note that with default values, the  input/result type of `reduce` op matches
 with the argument or the result type of the `reduce_computation`, including the
 quantization parameters.
 
-It is important to enforce the following aspect in the implementations: both
-conversion functions must be provided or both must be missing. If this is not
-followed, and there are two functions provided in the IR, it would be difficult
-to determine which function is intended for which purpose.
+Also, note that the relative order of `input_conversion` or `output_conversion`
+w.r.t the `reduce_computation` can be used to identify the appropriate
+conversion function when any one of `input_conversion` or `output_conversion` is
+missing.
 
 The existing pretty printing is currently producing the following output
 `stablehlo.reduce(%input init: %init_value) applies stablehlo.add across
@@ -333,7 +333,7 @@ non-quantized types. For example,
     %0 = "stablehlo.add"(%arg0, %arg1) : (tensor<f32>, tensor<f32>) ->
     tensor<f32>
     "stablehlo.return"(%0) : (tensor<f32>) -> ()
-  },
+  }, {
   ^output_conversion(%arg0: tensor<f32>):
     %0 = "stablehlo.convert"(%arg0): (tensor<f32>) -> (tensor<bf16>)
     "stablehlo.return"(%0) : (tensor<bf16>) -> (tensor<bf16>)
@@ -519,3 +519,4 @@ More formally:
        element_type(input_types(output_conversion))`.
 * (C11) `shape(operand) = shape(result)`.
 <!-- markdownlint-enable line-length -->
+```

From e38451e9149a855bb81d48efbfb7119f1e322149 Mon Sep 17 00:00:00 2001
From: Sandeep Dasgupta <sdasgup@google.com>
Date: Thu, 13 Jul 2023 22:04:27 +0000
Subject: [PATCH 03/11] Address feedback: fix a few type checks in the code
 snippets provodfed

---
 rfcs/20230622-quantized-reduction.md | 32 ++++++++++++++--------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/rfcs/20230622-quantized-reduction.md b/rfcs/20230622-quantized-reduction.md
index c5b1be8c4d6..812a4503fb0 100644
--- a/rfcs/20230622-quantized-reduction.md
+++ b/rfcs/20230622-quantized-reduction.md
@@ -162,21 +162,21 @@ should be interpreted as
     ^input_conversion(
             %input: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>):
         "stablehlo.return"(%input)
-            : (tensor<!quant.uniform<ui8:f32, input_scale:0>>) -> ()
+            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>) -> ()
 
     }, {
     ^reduce_computation(
-            %lhs: tensor<!quant.uniform<ui8:f32, input_scale:0>>,
-            %rhs: tensor<!quant.uniform<ui8:f32, input_scale:0>>):
+            %lhs: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>,
+            %rhs: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>):
         %add = "stablehlo.add"(%lhs, %rhs)
-            : (tensor<!quant.uniform<ui8:f32, input_scale:0>>,
-               tensor<!quant.uniform<ui8:f32, input_scale:0>>)
-            -> tensor<!quant.uniform<ui8:f32, input_scale:0>>
+            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>,
+               tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>)
+            -> tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>
         "stablehlo.return"(%add)
-            : (tensor<!quant.uniform<ui8:f32, input_scale:0>>) -> ()
+            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>) -> ()
     }, {
     ^output_conversion(
-            %intermediate_result: tensor<!quant.uniform<ui8:f32, input_scale:0>>):
+            %intermediate_result: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>):
         "stablehlo.return"(%intermediate_result)
             : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>) -> ()
     }) {
@@ -262,21 +262,21 @@ of `reduce` op. The code snippet for `reduce` looks like:
 %result = "stablehlo.reduce"(%input, %init_value) ({
     ^reduce_computation(
             %elem: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>,
-            %acc: tensor<!quant.uniform<i32:f32, input_scale:0>>):
+            %acc: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>):
         %elem_rescaled = "stablehlo.uniform_quantize"(%elem)
             : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>)
-            -> tensor<!quant.uniform<i32:f32, input_scale:0>>
+            -> tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>
         %add = "stablehlo.add"(%elem_rescaled, %acc)
-            : (tensor<!quant.uniform<i32:f32, input_scale:0>>,
-               tensor<!quant.uniform<i32:f32, input_scale:0>>)
-            -> tensor<!quant.uniform<i32:f32, input_scale:0>>
+            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>,
+               tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>)
+            -> tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>
         "stablehlo.return"(%0)
-            : (tensor<!quant.uniform<i32:f32, input_scale:0>>) -> ()
+            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>) -> ()
     }) {
         dimensions = dense<1> : tensor<1xi64>
     } : (tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>,
-         tensor<... x !quant.uniform<i32:f32, input_scale:0>>)
-    -> tensor<... x !quant.uniform<i32:f32, input_scale:0>>
+         tensor<... x !quant.uniform<i32:f32, accum_scale:accum_zp>>)
+    -> tensor<... x !quant.uniform<i32:f32, accum_scale:accum_zp>>
 ```
 
 In this option, the `init_value` type and the `result` type can be different

From 0eec267bc50916e6c1c2d13cbc90e380c660ad86 Mon Sep 17 00:00:00 2001
From: Sandeep Dasgupta <sdasgup@google.com>
Date: Thu, 13 Jul 2023 22:36:05 +0000
Subject: [PATCH 04/11] Address feedback: fix header indentation and code block
 language

---
 rfcs/20230622-quantized-reduction.md | 31 +++++++++++-----------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/rfcs/20230622-quantized-reduction.md b/rfcs/20230622-quantized-reduction.md
index 812a4503fb0..24249697778 100644
--- a/rfcs/20230622-quantized-reduction.md
+++ b/rfcs/20230622-quantized-reduction.md
@@ -2,12 +2,13 @@
 
 Status: Review<br/>
 Initial version: 06/22/2023<br/>
-Last updated: 07/02/2023<br/>
+Last updated: 07/13/2023<br/>
 Discussion thread: [GitHub](https://github.com/openxla/stablehlo/pull/1664)
 
 ## Version log
 
 * 06/22/2023: Initial version.
+* 07/13/2023: Fixed typo in code blocks, header indentation.
 
 ## Introduction
 
@@ -128,7 +129,7 @@ be applied to the non-leaf nodes of the schedule tree.
 The `output_conversion` block is applied just after the `result` for a particular
 index is computed as shown in the above diagram.
 
-Please refer to the [formal spec](#specification-of-reduce-op) of the proposed
+Please refer to the [formal spec](#revised-specification-of-reduce-op) of the proposed
 reduce op.
 
 ### Implementation details
@@ -206,7 +207,7 @@ generic printing,
 **even if the explicitly provided conversion functions are identity function**:
 To avoid identification of identity functions which could be tricky in general.
 
-#### Tradeoffs
+### Tradeoffs
 
 * (+) Enables programmers to program at (almost) baremetal. If the hardware
   can support reduction computation in wider type (e.g. in the SIMD
@@ -248,13 +249,10 @@ type.
   backend matches the pattern.
   [for more information](https://github.com/openxla/stablehlo/pull/1538#issuecomment-1599476906)
 
-## Other options considered
-
-There is another option considered which did not fly well because of limited
-extensibility. Adding it just for completeness purposes.
-
-### Option 3: allow accumulator type to be different from input type
+## Option 3: allow accumulator type to be different from input type
 
+This is another option we considered which does not fly well because of limited
+expressibility. Adding it just for completeness purposes.
 The idea here is to convey the accumulator type using the `init_value` operand
 of `reduce` op. The code snippet for `reduce` looks like:
 
@@ -284,7 +282,7 @@ from the input type. The first argument of the compute block is fixed for the
 traversed element and the second argument is fixed for the intermediate
 (accumulation) result.
 
-#### Tradeoffs
+### Tradeoffs
 
 * (+) Make the accumulation type explicit in the IR.
 * (-) This representation imposes a limitation on the evaluation order.
@@ -355,9 +353,8 @@ types.
 To provide an estimate of specification changes needed to implement option #1
 I have attempted to provide the blueprint here.
 
-### Specification of reduce op
+### Revised specification of reduce op
 
-```python
 #### Semantics
 
 Applies a reduction functions `input_conversion`, `body`, and
@@ -426,15 +423,13 @@ map(output_conversion, reduce(input_slices_converted))` where:
        element_type(input_types(output_conversion)...)`.
 * (C?) `shape(results...) = shape(inputs...)` except that the dimension
   sizes of `inputs...` corresponding to `dimensions` are not included.
-```
 
 The above specification of `reduce` op can be used to define the specification
 of other ops as shown below. For brevity, we are only presenting the relevant
 portions of the spec which needs modification.
 
-### reduce_window
+### Revised specification of  reduce_window op
 
-```python
 #### Semantics
 
 Applies a reduction functions `input_conversion`, `body`, and
@@ -471,9 +466,8 @@ where:
        `E'i = element_type(results[i])`.
 * (C?) `element_type(output_types(input_conversion)...) =
        element_type(input_types(output_conversion)...)`.
-```
 
-### select_and_scatter
+### Revised specification of select_and_scatter op
 
 This op originally takes two function arguments `select` and `scatter`. As the
 `select` function is supposed to perform a non-accumulative operation, we may
@@ -481,7 +475,6 @@ not need additional conversion functions associated with `select`. But the
 `scatter` function needs be accompanied with `input_conversion` and
 `output_conversion` functions.
 
-```python
 #### Semantics
 
 Scatters the values from the `source` tensor using `scatter` based on the
@@ -490,6 +483,7 @@ a `result` tensor.
 
 More formally:
 ...
+
 * `result[result_index] = reduce([source_values], [init_value], [0],
         input_conversion, scatter, output_conversion)`
  where:
@@ -519,4 +513,3 @@ More formally:
        element_type(input_types(output_conversion))`.
 * (C11) `shape(operand) = shape(result)`.
 <!-- markdownlint-enable line-length -->
-```

From 29ec7003c942b5232fe5bc55e3f23ac260a3dbd9 Mon Sep 17 00:00:00 2001
From: Sandeep Dasgupta <sdasgup@google.com>
Date: Tue, 25 Jul 2023 18:26:21 +0000
Subject: [PATCH 05/11] fix a few typos

---
 rfcs/20230622-quantized-reduction.md | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/rfcs/20230622-quantized-reduction.md b/rfcs/20230622-quantized-reduction.md
index 24249697778..1251a59f4d6 100644
--- a/rfcs/20230622-quantized-reduction.md
+++ b/rfcs/20230622-quantized-reduction.md
@@ -220,9 +220,8 @@ To avoid identification of identity functions which could be tricky in general.
 ## Option 2: re-scale input to accumulation type
 
 This option is the simplest from the POV for specification of quantized `reduce`
-op. This is adding `stablehlo.uniform_quantize` and `stablehlo.dequantize` ops
-respectively before and after reduce op which operates on the "accumulator"
-type.
+op. This is adding `stablehlo.uniform_quantize`ops before and after reduce op
+which operates on the "accumulator" type.
 
 ```mlir
 %widen = "stablehlo.uniform_quantize"(%input)
@@ -234,7 +233,7 @@ type.
     }
     : (tensor<... x !quant.uniform<i32:f32, ...>>) -> tensor<... x !quant.uniform<i32:f32, ...>>
 
-%narrowed = "stablehlo.uniform_dequantize"(%reduce)
+%narrowed = "stablehlo.uniform_quantize"(%reduce)
     : (tensor<... x !quant.uniform<i32:f32, ...>>) -> tensor<... x !quant.uniform<ui8:f32, ...>>
 ```
 

From e106ae8c6267f1965a5e71035145c0f61c6bb257 Mon Sep 17 00:00:00 2001
From: Sandeep Dasgupta <sdasgup@google.com>
Date: Fri, 11 Aug 2023 22:07:38 +0000
Subject: [PATCH 06/11] revised plan

---
 rfcs/20230622-quantized-reduction.md | 93 +++++++++++++++++++++++++++-
 1 file changed, 92 insertions(+), 1 deletion(-)

diff --git a/rfcs/20230622-quantized-reduction.md b/rfcs/20230622-quantized-reduction.md
index 1251a59f4d6..2152a5cc9ea 100644
--- a/rfcs/20230622-quantized-reduction.md
+++ b/rfcs/20230622-quantized-reduction.md
@@ -2,7 +2,9 @@
 
 Status: Review<br/>
 Initial version: 06/22/2023<br/>
-Last updated: 07/13/2023<br/>
+updated: 07/13/2023<br/>: Minor refactoring of the examples.
+Last updated: 08/11/2023<br/>: Revision of the proposal to introduce an
+attribute to capture accumulation type.<br/>
 Discussion thread: [GitHub](https://github.com/openxla/stablehlo/pull/1664)
 
 ## Version log
@@ -512,3 +514,92 @@ More formally:
        element_type(input_types(output_conversion))`.
 * (C11) `shape(operand) = shape(result)`.
 <!-- markdownlint-enable line-length -->
+
+## [11 Aug'23] Revised proposal
+
+### Context
+
+Option #2 should be avoided because it is hard to control the transformation
+which might disrupt the pattern to be matched. The option #1 sounds good except
+that the extra input/output conversion blocks are surplus information. The
+specification would benefit if the intent of the conversion blocks can be
+expressed precisely. The conversion blocks provides a way to capture the
+accumulation type needed to compute the accumulative operation on.
+
+The revised proposal is:
+
+* To capture the accumulation type via an additional StableHLO attribute like
+  `accumulation_element_type`.
+* The attribute seems beneficial for other ops as well like `dot_general` and
+  `convolution`.
+* `precision_config`, currently used for `dot_general` and `convolution`, is
+  used to override the precision specified by the input parameters, allowing the
+  choice of low precision vs high precision computation. We should consider
+  adding `precision_config` to all reduction based op as well.
+
+### Few implementation details
+
+#### On StableHLO side
+
+The reduce syntax to be augmented with a optional [type
+attribute](https://github.com/llvm/llvm-project/blob/51a57074bc63842970c4c160b05c1a7e42db7523/mlir/include/mlir/IR/OpBase.td#L1466)
+as follows:
+
+```mlir
+%0 = stablehlo.reduce(%arg0 init: %arg1) across dimensions = [0] {
+    accumulation_type = tensor<!quant.uniform<i32:f32, 3.400000e+01:16>>
+} : (tensor<16x!quant.uniform<i8:f32, 3.400000e+01:16>>, tensor<!quant.uniform<i8:f32, 3.400000e+01:16>>) -> tensor<!quant.uniform<i8:f32, 3.400000e+01:16>>
+    reducer(%arg2: tensor<!quant.uniform<i32:f32, 3.400000e+01:16>>, %arg3: tensor<!quant.uniform<i32:f32, 3.400000e+01:16>>)  {
+     %1 = stablehlo.add %arg2, %arg3 : tensor<!quant.uniform<i32:f32, 3.400000e+01:16>>
+     stablehlo.return %1 : tensor<!quant.uniform<i32:f32, 3.400000e+01:16>>
+    }
+
+// using tablegen specification like
+// OptionalAttr<TypeAttrOf<HLO_Tensor>>:$accumulation_type
+```
+
+Note that the main difference between this option and the option #1 is that the
+input and output conversion blocks are no longer used as their intent is
+specified via the `accumulation_type` attribute. However, the reducer block
+still needs to express the computation in accumulation type only.
+
+**Why optional attribute?**
+
+* At times, it might be desirable not to hard-code the accumulation type. For
+   example, when we would like to write a generic code and let the downstream
+   compilation tools to decide the exact accumulation type based on the hardware
+   of choice.
+* It allows the stablehlo, used in various existing pipelines, to remain
+  largely unaffected by this change.
+
+Next, the StableHLO specification should be updated with the syntax and
+semantics aspects of this attribute.
+
+#### On StableHLO Consumers side
+
+The consumers can pattern match the op taking the accumulation type in account
+if the targeted hardware supports accumulation at higher type.
+There are still to explore things about maintaining StableHLO-HLO parity which
+needs to be addresses as well.
+
+### Action Plan
+
+I propose to follow the action plan (order matters):
+
+* Update the specification of ReduceOp, ReduceWindowOp, and SelectAndScatterOp
+  op, taking the accumulation type into account, via [open
+  pr](https://github.com/openxla/stablehlo/pull/1538).
+* Finalize the quantized specification of AllReduceOp, BatchNormTrainingOp,
+    BatchNormGradOp and ReduceScatterOp, whose semantics depend on ReduceOp,
+    via [open ticket](https://github.com/openxla/stablehlo/issues/1666).
+* Add implementation for additional attribute in the above ops. This includes
+updating the tablegen spec/verifiers/type inferencers. [Need a ticket for this].
+* Address the disparity between StableHLO and HLO because of the introduction of
+this new attribute in StableHLO: Should/How XLA should consume this additional
+attribute? [Need a ticket for this].
+* Spec the behavior of `precision_config` in DotGeneralOp. [open
+issue](https://github.com/openxla/stablehlo/issues/755)
+* Consider adding `precision_config` in reduction op. [need a ticket for this
+* Consider adding `accumulation_type` to `dot_general`/`convolution op`.
+[need a ticket for this item].
+item].

From cc036473dddd640d1f2ad47d3589cfb46315ec66 Mon Sep 17 00:00:00 2001
From: Sandeep Dasgupta <sdasgup@google.com>
Date: Mon, 28 Aug 2023 02:08:03 +0000
Subject: [PATCH 07/11] Iteration 2: revised proposal: drop the accumulation
 type attribute for reduce op

---
 rfcs/20230622-quantized-reduction.md | 723 +++++++++------------------
 1 file changed, 245 insertions(+), 478 deletions(-)

diff --git a/rfcs/20230622-quantized-reduction.md b/rfcs/20230622-quantized-reduction.md
index 2152a5cc9ea..248d97c097a 100644
--- a/rfcs/20230622-quantized-reduction.md
+++ b/rfcs/20230622-quantized-reduction.md
@@ -2,8 +2,8 @@
 
 Status: Review<br/>
 Initial version: 06/22/2023<br/>
-updated: 07/13/2023<br/>: Minor refactoring of the examples.
-Last updated: 08/11/2023<br/>: Revision of the proposal to introduce an
+updated: 07/13/2023: Minor refactoring of the examples.<br/>
+Last updated: 08/11/2023: Revision of the proposal to introduce an
 attribute to capture accumulation type.<br/>
 Discussion thread: [GitHub](https://github.com/openxla/stablehlo/pull/1664)
 
@@ -11,6 +11,9 @@ Discussion thread: [GitHub](https://github.com/openxla/stablehlo/pull/1664)
 
 * 06/22/2023: Initial version.
 * 07/13/2023: Fixed typo in code blocks, header indentation.
+* 08/11/2023: Revision of the proposal to introduce an attribute to capture
+              accumulation type.
+* 08/25/2023: The additional attribute is redundant.
 
 ## Introduction
 
@@ -24,44 +27,32 @@ op, for non-quantized types, has constraints like
 
 which constrained the signature of reduce op and its associated reducer function
 `body` to have the same element types for `inputs`, `results` and arguments and
-return for `body`. For reducer function performing an accumulative operation like
-add, this means that the the result of accumulation can overflow in which case
-the result will be implementation defined (e.g.,
-[saturated](https://en.wikipedia.org/wiki/Saturation_arithmetic) or
-[wrap around](https://en.wikipedia.org/wiki/Integer_overflow)).
-From the conversation with customers it seems a reasonable behavior for non
-quantized data types. However, with quantized data types, such loss in precision
-is not acceptable and hence the motivation is to perform the accumulation in
-some higher data type.
-
-The RFC highlights some of the options emerged out of discussion in the
+return for `body`. For reducer function performing an accumulative operation
+like add, this means that the the result of accumulation can overflow in which
+case the result will be implementation defined (e.g.,
+        [saturated](https://en.wikipedia.org/wiki/Saturation_arithmetic) or
+        [wrap around](https://en.wikipedia.org/wiki/Integer_overflow)).  From
+the conversation with customers it seems a reasonable behavior for non quantized
+data types. However, with quantized data types, such loss in precision is not
+acceptable and hence the motivation is to perform the accumulation in some
+higher data type.
+
+The RFC introduces the following proposal, emerged out of discussion in the
 [thread](https://github.com/openxla/stablehlo/pull/1538#issuecomment-1599476906)
-along with their tradeoffs. The proposal option #1 looks promising at this
-point, but we are open to further discussion on this.
-
-## Option 1: Introduce additional conversion functions
-
-[The thread](https://github.com/openxla/stablehlo/pull/1538#issuecomment-1599476906)
-discuses an option, proposed by @loganchien, on how to achieve the structural
-changes as mentioned above. We note that some of the examples/diagrams presented
-here are borrowed from an internal doc @loganchien authored.
-
-The proposed options introduces on-the-fly type conversions, which (1) convert
-the input type to the type of the `body` function argument and (2) convert the
-result type of the `body` function to the output type. Following is the code
-snippet with the proposed syntax of reduce op:
+, along with their tradeoffs.
+
+The proposal allows the reducer block to express the computation in a different
+element type (preferably higher accumulation type) than the one used in reduce
+op's ops arguments and return type. For illustrative purposes, in the following
+example, the operand element type `tensor<!quant.uniform<ui8:f32,
+    input_scale:input_zp>>`  is different from the element type for
+    reduction region's block arguments.  Similarly, the element type of the
+    reduce op's result `!quant.uniform<ui8:f32, output_scale:output_zp>>` is
+    different from that of block return (`tensor<!quant.uniform<i32:f32,
+            accum_scale:accum_zp>>`).
 
 ```mlir
 %result = "stablehlo.reduce"(%input, %init_value) ({
-    ^input_conversion(
-            %input: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>):
-        %input_rescaled = "stablehlo.uniform_quantize"(%input)
-            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>)
-            -> tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>
-        "stablehlo.return"(%input_rescaled)
-            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>) -> ()
-
-    }, {
     ^reduce_computation(
             %lhs: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>,
             %rhs: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>):
@@ -71,310 +62,63 @@ snippet with the proposed syntax of reduce op:
             -> tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>
         "stablehlo.return"(%add)
             : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>) -> ()
-    }, {
-    ^output_conversion(
-            %intermediate_result: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>):
-        %output_rescaled = "stablehlo.uniform_quantize"(%intermediate_result)
-            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>)
-            -> tensor<!quant.uniform<ui8:f32, output_scale:output_zp>>
-        "stablehlo.return"(%output_rescaled)
-            : (tensor<!quant.uniform<ui8:f32, output_scale:output_zp>>) -> ()
     }) {
-        dimensions = dense<...> : tensor<1xi64>
-    } : (tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>,
-         tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>)
-    -> tensor<... x !quant.uniform<ui8:f32, output_scale:output_zp>>
+        dimensions = dense<1> : tensor<i64>
+    } : (tensor<5 x 1 x !quant.uniform<ui8:f32, input_scale:input_zp>>,
+         tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>)
+    -> tensor<5 x !quant.uniform<ui8:f32, output_scale:output_zp>>
 ```
 
 ### Semantics
 
-Here we will informally propose the semantics of the additional functions
-`input_conversion` and `output_conversion` introduced.
-
-```python
-+----------+  +--------+ +--------+    +----------+  +--------+ +--------+
-|init_value|  |input[0]| |input[1]|    |init_value|  |input[2]| |input[3]|
-+----------+  +--------+ +--------+    +----------+  +--------+ +--------+
-    |             |          |               |           |          |
-+----------+  +--------+ +--------+    +----------+  +--------+ +--------+
-|input     |  |input   | |input   |    |input     |  |input   | |input   |
-|convert   |  |convert | |convert |    |convert   |  |convert | |convert |
-+----------+  +--------+ +--------+    +----------+  +--------+ +--------+
-      \      /           /                   \      /           /
-      +-------+         /                    +-------+         /
-      |compute|        /                     |compute|        /
-      +-------+       /                      +-------+       /
-             \       /                              \       /
-              +-------+                              +-------+
-              |compute|                              |compute|
-              +-------+                              +-------+
-                     \___________           ___________/
-                                 \         /
-                                  +-------+
-                                  |compute|
-                                  +-------+
-                                      |
-                                  +-------+
-                                  |output |
-                                  |convert|
-                                  +-------+
-```
-
-### Semantics of `input_conversion` block
-
-The `input_conversion` block is applied selectively to the leaf nodes of a
-schedule tree as shown in above diagram. Note that the `input_conversion` cannot
-be applied to the non-leaf nodes of the schedule tree.
-
-### Semantics of `output_conversion` block
-
-The `output_conversion` block is applied just after the `result` for a particular
-index is computed as shown in the above diagram.
-
-Please refer to the [formal spec](#revised-specification-of-reduce-op) of the proposed
-reduce op.
-
-### Implementation details
-
-From the implementation POV of the proposed spec, we note that
-`input_conversion` or `output_conversion` can very well be optional with
-default values as identity functions. For example, the following code snippet
-
-```mlir
-%result = "stablehlo.reduce"(%input, %init_value) ({
-    ^reduce_computation(
-            %lhs: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>,
-            %rhs: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>):
-        %add = "stablehlo.add"(%lhs, %rhs)
-            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>,
-               tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>)
-            -> tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>
-        "stablehlo.return"(%add)
-            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>) -> ()
-    }) {
-        dimensions = dense<...> : tensor<1xi64>
-    } : (tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>,
-         tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>)
-    -> tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>
-```
-
-should be interpreted as
-
-```mlir
-%result = "stablehlo.reduce"(%input, %init_value) ({
-    ^input_conversion(
-            %input: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>):
-        "stablehlo.return"(%input)
-            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>) -> ()
-
-    }, {
-    ^reduce_computation(
-            %lhs: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>,
-            %rhs: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>):
-        %add = "stablehlo.add"(%lhs, %rhs)
-            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>,
-               tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>)
-            -> tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>
-        "stablehlo.return"(%add)
-            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>) -> ()
-    }, {
-    ^output_conversion(
-            %intermediate_result: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>):
-        "stablehlo.return"(%intermediate_result)
-            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>) -> ()
-    }) {
-        dimensions = dense<...> : tensor<1xi64>
-    } : (tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>,
-         tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>)
-    -> tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>
-```
-
-Note that with default values, the  input/result type of `reduce` op matches
-with the argument or the result type of the `reduce_computation`, including the
-quantization parameters.
-
-Also, note that the relative order of `input_conversion` or `output_conversion`
-w.r.t the `reduce_computation` can be used to identify the appropriate
-conversion function when any one of `input_conversion` or `output_conversion` is
-missing.
-
-The existing pretty printing is currently producing the following output
-`stablehlo.reduce(%input init: %init_value) applies stablehlo.add across
-dimensions = [1] : (tensor<1x6xi64>, tensor<i64>) -> tensor<1xi64>`. IMO,
-modifying the above format, with the default conversion function, will create
-clutter. My proposal here is to follow the existing pretty printing when the
-conversion functions are "not provided". In the event, the conversion functions
-are explicitly provided, then the pretty printers will fall back to default
-generic printing,
-**even if the explicitly provided conversion functions are identity function**:
-To avoid identification of identity functions which could be tricky in general.
-
-### Tradeoffs
-
-* (+) Enables programmers to program at (almost) baremetal. If the hardware
-  can support reduction computation in wider type (e.g. in the SIMD
-  instruction set, we typically do widening/compute/narrowing within the
-  kernel to save the memory bandwidth), the programmer can explicitly request
-  for that.
-* (-) The disadvantage of this representation is that the syntax is more
-  verbose and requires significant changes to the specification.
-
-## Option 2: re-scale input to accumulation type
-
-This option is the simplest from the POV for specification of quantized `reduce`
-op. This is adding `stablehlo.uniform_quantize`ops before and after reduce op
-which operates on the "accumulator" type.
-
-```mlir
-%widen = "stablehlo.uniform_quantize"(%input)
-    : (tensor<... x !quant.uniform<ui8:f32, ...>>) -> tensor<... x !quant.uniform<i32:f32, ...>>
-
-%reduce = "stablehlo.reduce"(%widen) {
-    ^reduce_computation(%lhs: !quant.uniform<i32:f32, ...>, %rhs: !qunat.uniform<i32:f32, ...>):
-        // reduce_computation_block
-    }
-    : (tensor<... x !quant.uniform<i32:f32, ...>>) -> tensor<... x !quant.uniform<i32:f32, ...>>
-
-%narrowed = "stablehlo.uniform_quantize"(%reduce)
-    : (tensor<... x !quant.uniform<i32:f32, ...>>) -> tensor<... x !quant.uniform<ui8:f32, ...>>
-```
-
-### Tradeoffs
-
-* (+) An advantage of this option is that we only need minor changes to the
-  specification (i.e. to allow quantized types).
-* (-) The compiler must pattern match 3 operations and map them into some
-  internal representation before their compilation or execution.
-* (-) The compiler must ensure that the `stablehlo.uniform_quantize` (or
-  `stablehlo.convert` in the case of `bf16` or `f16`) is not folded before the
-  backend matches the pattern.
-  [for more information](https://github.com/openxla/stablehlo/pull/1538#issuecomment-1599476906)
-
-## Option 3: allow accumulator type to be different from input type
-
-This is another option we considered which does not fly well because of limited
-expressibility. Adding it just for completeness purposes.
-The idea here is to convey the accumulator type using the `init_value` operand
-of `reduce` op. The code snippet for `reduce` looks like:
-
-```mlir
-%result = "stablehlo.reduce"(%input, %init_value) ({
-    ^reduce_computation(
-            %elem: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>,
-            %acc: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>):
-        %elem_rescaled = "stablehlo.uniform_quantize"(%elem)
-            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>)
-            -> tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>
-        %add = "stablehlo.add"(%elem_rescaled, %acc)
-            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>,
-               tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>)
-            -> tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>
-        "stablehlo.return"(%0)
-            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>) -> ()
-    }) {
-        dimensions = dense<1> : tensor<1xi64>
-    } : (tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>,
-         tensor<... x !quant.uniform<i32:f32, accum_scale:accum_zp>>)
-    -> tensor<... x !quant.uniform<i32:f32, accum_scale:accum_zp>>
-```
-
-In this option, the `init_value` type and the `result` type can be different
-from the input type. The first argument of the compute block is fixed for the
-traversed element and the second argument is fixed for the intermediate
-(accumulation) result.
-
-### Tradeoffs
-
-* (+) Make the accumulation type explicit in the IR.
-* (-) This representation imposes a limitation on the evaluation order.
-  Since we can’t express the computation between two intermediate (accumulation)
-  results, we can not arbitrarily insert `init_value` and start the
-  computation at an arbitrary location. The following shows the restricted
-  evaluation order with the method.
-
-```python
-+----------+       +--------+ +--------+ +--------+ +--------+
-|init_value|       |input[0]| |input[1]| |input[2]| |input[3]|
-+----------+       +--------+ +--------+ +--------+ +--------+
-           \        /         /          /          /
-           +-------+         /          /          /
-           |compute|        /          /          /
-           +-------+       /          /          /
-                \         /          /          /
-                 +-------+          /          /
-                 |compute|         /          /
-                 +-------+        /          /
-                       \         /          /
-                        +-------+          /
-                        |compute|         /
-                        +-------+        /
-                              \         /
-                               +-------+
-                               |compute|
-                               +-------+
-```
-
-## Open Question
-
-### Should we restrict the proposal #1 to quantized types only?
-
-The above proposal #1 of introducing the additional functions is theoretically
-not limited to quantized `reduce` op, but also can be applied to `reduce` op with
-non-quantized types. For example,
-
-```mlir
-%result = "stablehlo.reduce"(%input, %init_value) ({
-  ^input_conversion(%arg0: tensor<bf16>):
-    %0 = "stablehlo.convert"(%arg0): (tensor<bf16>) -> (tensor<f32>)
-    "stablehlo.return"(%0) : (tensor<f32>) -> (tensor<f32>)
-  }, {
-  ^bb0(%arg0: tensor<f32>, %arg1: tensor<f32>):
-    %0 = "stablehlo.add"(%arg0, %arg1) : (tensor<f32>, tensor<f32>) ->
-    tensor<f32>
-    "stablehlo.return"(%0) : (tensor<f32>) -> ()
-  }, {
-  ^output_conversion(%arg0: tensor<f32>):
-    %0 = "stablehlo.convert"(%arg0): (tensor<f32>) -> (tensor<bf16>)
-    "stablehlo.return"(%0) : (tensor<bf16>) -> (tensor<bf16>)
-  }) {
-  dimensions = dense<1> : tensor<1xbf16>
-} : (tensor<1x6xbf16>, tensor<bf16>) -> tensor<1xbf16>
-```
-
-However, it is not clear how such operations will be lowered to other IR
-representations, like HLO, which does not support such additional computation
-blocks. IMO there is no additional benefit to support such conversion
-functions for regular type given that there already exists infrastructure
-(backend support, lowering passes) to support regular types w/o conversion
-functions. My proposal here would be to restrict the support to only quantized
-types.
+Depending on (1) the input operand type is different from the reduction block
+argument type or (2) the op result type is different from the reduction block
+return type, there will be implicit type conversion defined by either
+`stablehlo.convert`, `stablehlo.uniform_quantize`, or
+`stablehlo.uniform_dequantize`. For example,
+
+ | Implicit type conversion op       | element type of operand or result type | element type of block argument or block return type |
+ |-----------------------------------|----------------------------------------|-----------------------------------------------------|
+ | (A) `stablehlo.uniorm_quantize`   | quantized tensor                       | quantized tensor                                    |
+ | (B) `stablehlo.uniorm_quantize`   | floating point                         | quantized tensor                                    |
+ | (C) `stablehlo.uniorm_dequantize` | quantized tensor                       | floating point                                      |
+ | (D) `stablehlo.convert`           | floating-point                         | integer                                             |
+ | (E) `stablehlo.convert`           | integer                                | floating-point                                      |
+ | (F) `stablehlo.convert`           | floating-point                         | floating-point                                      |
+ | (G) `stablehlo.convert`           | integer                                | integer                                             |
+ | (G) `stablehlo.convert`           | complex                                | complex                                             |
+
+At this point there is no use for cases other than (A), (F), and (G).  My
+proposal here would be to address (A), (F), and (G) only.  Note that the (F)
+    partially addresses [Decide on mixed
+    precision](https://github.com/openxla/stablehlo/issues/369) for reduce op in
+    that it allows the the input or init value to differ from the corresponding
+    block arguments w.r.t the precision of floating-point types. However, the
+    mixed precision implementation in HLO seems more detailed in the sense that
+    even allows `inputs` and `init_values` to differ in floating-point
+    precision. My proposal would be to treat the above ticket separately.
 
 ## Appendix
 
-To provide an estimate of specification changes needed to implement option #1
-I have attempted to provide the blueprint here.
+To provide an estimate of specification changes needed to implement the
+proposal, I have attempted to provide the blueprint here.
 
 ### Revised specification of reduce op
 
-#### Semantics
+Here we include only the relevant portions of the spec with the proposed update.
 
-Applies a reduction functions `input_conversion`, `body`, and
-`output_conversion` to `inputs` and `init_values` along the `dimensions` and
-produces `results` tensors.
+#### Semantics
 
-The order of reductions is implementation-defined, which means that `body` and
-`init_values` must form a monoid to guarantee that the operation produces the
-same results for all inputs on all implementations. However, this condition
-doesn't hold for many popular reductions. E.g. floating-point addition for
-`body` and zero for `init_values` don't actually form a monoid because
-floating-point addition is not associative.
+...
 
 More formally, `results...[j0, ..., jR-1] =
-map(output_conversion, reduce(input_slices_converted))` where:
+reduce_implicit_convert(reduce(input_slices_converted),
+        type(func_outputs(body)...), type(results...)))` where:
 
 * `input_slices = inputs...[j0, ..., :, ..., jR-1]`, where `:` are inserted
   at `dimensions`.
-* `input_slices_converted = map(input_conversion, input_slices...)`.
+* `input_slices_converted = reduce_implicit_convert(input_slices...,
+        type(inputs...), type(func_inputs(body)...)`.
 * `reduce(input_slices_converted) = exec(schedule)` for some binary tree
   `schedule` where:
   * `exec(node) = body(exec(node.left), exec(node.right))`.
@@ -384,89 +128,52 @@ map(output_conversion, reduce(input_slices_converted))` where:
   * `input_slices_converted...[index]` values, for all `index` in
     `index_space(input_slices_converted)` in the ascending lexicographic order
     of `index`.
-  * Interspersed with an implementation-defined amount of `init_values`
+  * Interspersed with an implementation-defined amount of
+  `reduce_implicit_convert(init_values..., type(init_values...), type(func_inputs(body)[:len(func_inputs(body)//2)])...)`
     at implementation-defined positions.
 
-#### Inputs
-
-| Label | Name                | Type                                         | Constraints |
-|-------|---------------------|----------------------------------------------|-------------|
-| (I?)  | `inputs`            | variadic number of tensors                   |             |
-| (I?)  | `init_values`       | variadic number of 0-dimensional tensors     |             |
-| (I?)  | `dimensions`        | 1-dimensional tensor constant of type `si64` |             |
-| (I?)  | `input_conversion`  | function                                     |             |
-| (I?)  | `body`              | function                                     |             |
-| (I?)  | `output_conversion` | function                                     |             |
-
-#### Outputs
-
-| Name      | Type                       | Constraints |
-|-----------|----------------------------|-------------|
-| `results` | variadic number of tensors |             |
-
 #### Constraints
 
 * (C?) `same(shape(inputs...))`.
 * (C?) `element_type(inputs...) = element_type(init_values...)`.
 * (C?) `baseline_element_type(inputs...) = baseline_element_type(results...)`.
-* (C?) `0 < size(inputs) = size(init_values) = size(results) = N`.
-* (C?) `0 <= dimensions < rank(inputs[0])`.
-* (C?) `is_unique(dimensions)`.
-* (C?) `input_conversion` has type `tensor<E0>, ..., tensor<EN-1> ->
-       (tensor<E'0>, ..., tensor<E'N-1>)` where `Ei = element_type(inputs[i])`.
 * (C?) `body` has type `tensor<E0>, ..., tensor<EN-1>, tensor<E0>, ...,`
        `tensor<EN-1>) -> (tensor<E0>, ..., tensor<EN-1>)` where
-       `Ei = element_type(output_types(input_conversion)[i])`.
-* (C?) `output_conversion` has type `tensor<E0>, ..., tensor<EN-1> ->
-       (tensor<E'0>, ..., tensor<E'N-1>)` where
-       `E'i = element_type(results[i])`.
-* (C?) `element_type(output_types(input_conversion)...) =
-       element_type(input_types(output_conversion)...)`.
+       `is_integer(element_type(inputs[i])) = is_integer(element_type(Ei]` or
+       `is_float(element_type(inputs[i])) = is_float(element_type(Ei]` or
+       `is_complex(element_type(inputs[i])) = is_complex(element_type(Ei]` or
+       `is_quantized(element_type(inputs[i])) = is_quantized(element_type(Ei]`.
 * (C?) `shape(results...) = shape(inputs...)` except that the dimension
   sizes of `inputs...` corresponding to `dimensions` are not included.
 
+`reduce_implicit_convert` is defined as
+
+```python
+def reduce_implicit_convert(x: Value, source_type: Type, destination_type:
+        Type):
+    if source_type == destination_type:
+        return x
+    if is_quantized(source_type) and is_quantized(destination_type):
+        return quantize(x, destination_type)
+    return convert(x, destination_type)
+```
+
 The above specification of `reduce` op can be used to define the specification
-of other ops as shown below. For brevity, we are only presenting the relevant
+of other ops as shown below. As before, we are only presenting the relevant
 portions of the spec which needs modification.
 
 ### Revised specification of  reduce_window op
 
-#### Semantics
-
-Applies a reduction functions `input_conversion`, `body`, and
-`output_conversion` to windows of `inputs` and `init_values` and produces
-`results`.
-
-...
-
-More formally,
-`results...[result_index] = reduce(windows, init_values, axes(inputs...),
-        input_conversion, body, output_conversion)`
-where:
-....
-
-#### Inputs
-
-| Label | Name                | Type     |
-|-------|---------------------|----------|
-| (I?)  | `input_conversion`  | function |
-| (I8)  | `body`              | function |
-| (I?)  | `output_conversion` | function |
-
 #### Constraints
 
 * (C?) `element_type(inputs...) = element_type(init_values...)`.
 * (C?) `baseline_element_type(inputs...) = baseline_element_type(results...)`.
-* (C?) `input_conversion` has type `tensor<E0>, ..., tensor<EN-1> ->
-       (tensor<E'0>, ..., tensor<E'N-1>)` where `Ei = element_type(inputs[i])`.
 * (C?) `body` has type `tensor<E0>, ..., tensor<EN-1>, tensor<E0>, ...,`
        `tensor<EN-1>) -> (tensor<E0>, ..., tensor<EN-1>)` where
-       `Ei = element_type(output_types(input_conversion)[i])`.
-* (C?) `output_conversion` has type `tensor<E0>, ..., tensor<EN-1> ->
-       (tensor<E'0>, ..., tensor<E'N-1>)` where
-       `E'i = element_type(results[i])`.
-* (C?) `element_type(output_types(input_conversion)...) =
-       element_type(input_types(output_conversion)...)`.
+       `is_integer(element_type(inputs[i])) = is_integer(element_type(Ei]` or
+       `is_float(element_type(inputs[i])) = is_float(element_type(Ei]` or
+       `is_complex(element_type(inputs[i])) = is_complex(element_type(Ei]` or
+       `is_quantized(element_type(inputs[i])) = is_quantized(element_type(Ei]`.
 
 ### Revised specification of select_and_scatter op
 
@@ -476,74 +183,171 @@ not need additional conversion functions associated with `select`. But the
 `scatter` function needs be accompanied with `input_conversion` and
 `output_conversion` functions.
 
-#### Semantics
-
-Scatters the values from the `source` tensor using `scatter` based on the
-outcome of `reduce_window` of the `input` tensor using `select` and produces
-a `result` tensor.
-
-More formally:
-...
-
-* `result[result_index] = reduce([source_values], [init_value], [0],
-        input_conversion, scatter, output_conversion)`
- where:
- ...
-
-#### Inputs
-
-| Label | Name                | Type     |
-|-------|---------------------|----------|
-| (I8)  | `input_conversion`  | function |
-| (I8)  | `scatter`           | function |
-| (I8)  | `output_conversion` | function |
-
 #### Constraints
 
 <!-- markdownlint-disable line-length -->
 * (C1) `element_type(operand) = element_type(source)`.
 * (C3) `element_type(init_value) = element_type(operand)`.
 * (C?) `baseline_element_type(inputs...) = baseline_element_type(results...)`.
-* (C?) `input_conversion` has type `tensor<E> -> (tensor<E'>)` where
-       `Ei = element_type(operand)`.
 * (C10) `scatter` has type `(tensor<E>, tensor<E>) -> tensor<E>` where
-        `E = element_type(output_types(input_conversion))`.
-* (C?) `output_conversion` has type `tensor<E> -> (tensor<E'>)` where
-       `E'i = element_type(result)`.
-* (C?) `element_type(output_types(input_conversion)) =
-       element_type(input_types(output_conversion))`.
-* (C11) `shape(operand) = shape(result)`.
+       `is_integer(element_type(operand)) = is_integer(element_type(E]` or
+       `is_float(element_type(operand)) = is_float(element_type(E]` or
+       `is_complex(element_type(operand)) = is_complex(element_type(E]` or
+       `is_quantized(element_type(operand)) = is_quantized(element_type(E]`.
 <!-- markdownlint-enable line-length -->
 
-## [11 Aug'23] Revised proposal
+### Action Plan
+
+I propose to follow the action plan (order matters):
+
+* Update the specification of ReduceOp, ReduceWindowOp, and SelectAndScatterOp
+  op, taking the accumulation type into account, via [open
+  pr](https://github.com/openxla/stablehlo/pull/1538).
+* Finalize the quantized specification of AllReduceOp, BatchNormTrainingOp,
+    BatchNormGradOp and ReduceScatterOp, whose semantics depend on ReduceOp,
+    via [open ticket](https://github.com/openxla/stablehlo/issues/1666).
+* Spec the behavior of `precision_config` in DotGeneralOp. [open
+issue](https://github.com/openxla/stablehlo/issues/755)
+* Consider adding `precision_config` in reduction op.  `precision_config`,
+currently used for `dot_general` and `convolution`, to override the precision
+specified by the input parameters, allowing the choice of low precision vs high
+precision computation. We should consider adding `precision_config` to all
+reduction based op as well. [need a ticket for this]
+* Consider adding `accumulation_type` to `dot_general`/`convolution op`. The
+attribute seems beneficial for ops like `dot_general` and `convolution` which
+does not have an explicit reduction function. [need a ticket for this item].
+
+## Summary of previous proposals
+
+For completeness of the presentation, let me provide the proposals which are
+evaluated previously and help shape the current proposal.
+
+### Re-scale input to accumulation type
+
+This option is the simplest from the POV for specification of quantized `reduce`
+op. This is adding `stablehlo.uniform_quantize`ops before and after reduce op
+which operates on the "accumulator" type.
+
+```mlir
+%widen = "stablehlo.uniform_quantize"(%input)
+    : (tensor<... x !quant.uniform<ui8:f32, ...>>) -> tensor<... x !quant.uniform<i32:f32, ...>>
+
+%reduce = "stablehlo.reduce"(%widen) {
+    ^reduce_computation(%lhs: !quant.uniform<i32:f32, ...>, %rhs: !qunat.uniform<i32:f32, ...>):
+        // reduce_computation_block
+    }
+    : (tensor<... x !quant.uniform<i32:f32, ...>>) -> tensor<... x !quant.uniform<i32:f32, ...>>
+
+%narrowed = "stablehlo.uniform_quantize"(%reduce)
+    : (tensor<... x !quant.uniform<i32:f32, ...>>) -> tensor<... x !quant.uniform<ui8:f32, ...>>
+```
+
+#### Tradeoffs
+
+* (+) An advantage of this option is that we only need minor changes to the
+  specification (i.e. to allow quantized types).
+* (-) The compiler must pattern match 3 operations and map them into some
+  internal representation before their compilation or execution.
+* (-) The compiler must ensure that the `stablehlo.uniform_quantize` (or
+  `stablehlo.convert` in the case of `bf16` or `f16`) is not folded before the
+  backend matches the pattern.
+  [for more information](https://github.com/openxla/stablehlo/pull/1538#issuecomment-1599476906)
+
+This proposal should be avoided because it is hard to control the transformation
+which might disrupt the pattern to be matched.
+
+### Introduce on-the-fly type conversions
+
+Proposes addition two regions in reduce op to (1) convert the input type to the
+type of the `body` function argument and (2) convert the result type of the
+`body` function to the output type. Following is the code snippet with the
+proposed syntax of reduce op:
+
+```mlir
+%result = "stablehlo.reduce"(%input, %init_value) ({
+    ^input_conversion(
+            %input: tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>):
+        %input_rescaled = "stablehlo.uniform_quantize"(%input)
+            : (tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>)
+            -> tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>
+        "stablehlo.return"(%input_rescaled)
+            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>) -> ()
 
-### Context
+    }, {
+    ^reduce_computation(
+            %lhs: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>,
+            %rhs: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>):
+        %add = "stablehlo.add"(%lhs, %rhs)
+            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>,
+               tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>)
+            -> tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>
+        "stablehlo.return"(%add)
+            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>) -> ()
+    }, {
+    ^output_conversion(
+            %intermediate_result: tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>):
+        %output_rescaled = "stablehlo.uniform_quantize"(%intermediate_result)
+            : (tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>)
+            -> tensor<!quant.uniform<ui8:f32, output_scale:output_zp>>
+        "stablehlo.return"(%output_rescaled)
+            : (tensor<!quant.uniform<ui8:f32, output_scale:output_zp>>) -> ()
+    }) {
+        dimensions = dense<...> : tensor<1xi64>
+    } : (tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>,
+         tensor<... x !quant.uniform<ui8:f32, input_scale:input_zp>>)
+    -> tensor<... x !quant.uniform<ui8:f32, output_scale:output_zp>>
+```
 
-Option #2 should be avoided because it is hard to control the transformation
-which might disrupt the pattern to be matched. The option #1 sounds good except
-that the extra input/output conversion blocks are surplus information. The
-specification would benefit if the intent of the conversion blocks can be
-expressed precisely. The conversion blocks provides a way to capture the
-accumulation type needed to compute the accumulative operation on.
+Here we will informally propose the semantics of the additional functions
+`input_conversion` and `output_conversion` introduced.
 
-The revised proposal is:
+```python
++----------+  +--------+ +--------+    +----------+  +--------+ +--------+
+|init_value|  |input[0]| |input[1]|    |init_value|  |input[2]| |input[3]|
++----------+  +--------+ +--------+    +----------+  +--------+ +--------+
+    |             |          |               |           |          |
++----------+  +--------+ +--------+    +----------+  +--------+ +--------+
+|input     |  |input   | |input   |    |input     |  |input   | |input   |
+|convert   |  |convert | |convert |    |convert   |  |convert | |convert |
++----------+  +--------+ +--------+    +----------+  +--------+ +--------+
+      \      /           /                   \      /           /
+      +-------+         /                    +-------+         /
+      |compute|        /                     |compute|        /
+      +-------+       /                      +-------+       /
+             \       /                              \       /
+              +-------+                              +-------+
+              |compute|                              |compute|
+              +-------+                              +-------+
+                     \___________           ___________/
+                                 \         /
+                                  +-------+
+                                  |compute|
+                                  +-------+
+                                      |
+                                  +-------+
+                                  |output |
+                                  |convert|
+                                  +-------+
+```
 
-* To capture the accumulation type via an additional StableHLO attribute like
-  `accumulation_element_type`.
-* The attribute seems beneficial for other ops as well like `dot_general` and
-  `convolution`.
-* `precision_config`, currently used for `dot_general` and `convolution`, is
-  used to override the precision specified by the input parameters, allowing the
-  choice of low precision vs high precision computation. We should consider
-  adding `precision_config` to all reduction based op as well.
+### Tradeoffs
 
-### Few implementation details
+* (+) Enables programmers to program at (almost) baremetal. If the hardware
+  can support reduction computation in wider type (e.g. in the SIMD
+  instruction set, we typically do widening/compute/narrowing within the
+  kernel to save the memory bandwidth), the programmer can explicitly request
+  for that.
+* (-) The disadvantage of this representation is that the syntax is more
+  verbose and requires significant changes to the specification.
+* (-) The extra input/output conversion blocks are surplus information.  The
+intent of conversion blocks is to capture the accumulation type needed to
+compute the accumulative operation on. The specification would benefit if the
+intent can be expressed succinctly.
 
-#### On StableHLO side
+### Introduce accumulation type attribute
 
-The reduce syntax to be augmented with a optional [type
-attribute](https://github.com/llvm/llvm-project/blob/51a57074bc63842970c4c160b05c1a7e42db7523/mlir/include/mlir/IR/OpBase.td#L1466)
-as follows:
+Instead of using additional input and output conversion blocks, use a type
+attribute `accumulation type` to capture the accumulation type. As an example,
 
 ```mlir
 %0 = stablehlo.reduce(%arg0 init: %arg1) across dimensions = [0] {
@@ -558,48 +362,11 @@ as follows:
 // OptionalAttr<TypeAttrOf<HLO_Tensor>>:$accumulation_type
 ```
 
-Note that the main difference between this option and the option #1 is that the
-input and output conversion blocks are no longer used as their intent is
-specified via the `accumulation_type` attribute. However, the reducer block
-still needs to express the computation in accumulation type only.
-
-**Why optional attribute?**
-
-* At times, it might be desirable not to hard-code the accumulation type. For
-   example, when we would like to write a generic code and let the downstream
-   compilation tools to decide the exact accumulation type based on the hardware
-   of choice.
-* It allows the stablehlo, used in various existing pipelines, to remain
-  largely unaffected by this change.
-
-Next, the StableHLO specification should be updated with the syntax and
-semantics aspects of this attribute.
-
-#### On StableHLO Consumers side
-
-The consumers can pattern match the op taking the accumulation type in account
-if the targeted hardware supports accumulation at higher type.
-There are still to explore things about maintaining StableHLO-HLO parity which
-needs to be addresses as well.
-
-### Action Plan
-
-I propose to follow the action plan (order matters):
+Note that the main difference between this option and the previous option  is
+that the input and output conversion blocks are no longer used and their intent
+is specified via the `accumulation_type` attribute. However, the reducer block
+needs to express the computation in accumulation type only.
 
-* Update the specification of ReduceOp, ReduceWindowOp, and SelectAndScatterOp
-  op, taking the accumulation type into account, via [open
-  pr](https://github.com/openxla/stablehlo/pull/1538).
-* Finalize the quantized specification of AllReduceOp, BatchNormTrainingOp,
-    BatchNormGradOp and ReduceScatterOp, whose semantics depend on ReduceOp,
-    via [open ticket](https://github.com/openxla/stablehlo/issues/1666).
-* Add implementation for additional attribute in the above ops. This includes
-updating the tablegen spec/verifiers/type inferencers. [Need a ticket for this].
-* Address the disparity between StableHLO and HLO because of the introduction of
-this new attribute in StableHLO: Should/How XLA should consume this additional
-attribute? [Need a ticket for this].
-* Spec the behavior of `precision_config` in DotGeneralOp. [open
-issue](https://github.com/openxla/stablehlo/issues/755)
-* Consider adding `precision_config` in reduction op. [need a ticket for this
-* Consider adding `accumulation_type` to `dot_general`/`convolution op`.
-[need a ticket for this item].
-item].
+This options is discarded because, for reduce op, the additional attribute seems
+redundant and can be inferred based on the differences in element type of
+operand and reduction block arguments (as described in the current proposal).

From 538cb8f5bc0a683da203e0c560b14303ebbb6ce6 Mon Sep 17 00:00:00 2001
From: Sandeep Dasgupta <sdasgup@google.com>
Date: Tue, 29 Aug 2023 20:46:53 +0000
Subject: [PATCH 08/11] review iteration: various typo fixes

---
 rfcs/20230622-quantized-reduction.md | 96 ++++++++++++++--------------
 1 file changed, 48 insertions(+), 48 deletions(-)

diff --git a/rfcs/20230622-quantized-reduction.md b/rfcs/20230622-quantized-reduction.md
index 248d97c097a..3ec7a8a26af 100644
--- a/rfcs/20230622-quantized-reduction.md
+++ b/rfcs/20230622-quantized-reduction.md
@@ -42,14 +42,14 @@ The RFC introduces the following proposal, emerged out of discussion in the
 , along with their tradeoffs.
 
 The proposal allows the reducer block to express the computation in a different
-element type (preferably higher accumulation type) than the one used in reduce
+element type (preferably wider accumulation type) than the one used in reduce
 op's ops arguments and return type. For illustrative purposes, in the following
-example, the operand element type `tensor<!quant.uniform<ui8:f32,
-    input_scale:input_zp>>`  is different from the element type for
-    reduction region's block arguments.  Similarly, the element type of the
-    reduce op's result `!quant.uniform<ui8:f32, output_scale:output_zp>>` is
-    different from that of block return (`tensor<!quant.uniform<i32:f32,
-            accum_scale:accum_zp>>`).
+example, the operand element type
+`tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>`  is different from the
+element type for reduction region's block arguments.  Similarly, the element
+type of the reduce op's result
+`!quant.uniform<ui8:f32, output_scale:output_zp>>` is different from that of
+block return (`tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>`).
 
 ```mlir
 %result = "stablehlo.reduce"(%input, %init_value) ({
@@ -71,32 +71,32 @@ example, the operand element type `tensor<!quant.uniform<ui8:f32,
 
 ### Semantics
 
-Depending on (1) the input operand type is different from the reduction block
+If (1) the input operand type is different from the reduction block
 argument type or (2) the op result type is different from the reduction block
 return type, there will be implicit type conversion defined by either
 `stablehlo.convert`, `stablehlo.uniform_quantize`, or
 `stablehlo.uniform_dequantize`. For example,
 
- | Implicit type conversion op       | element type of operand or result type | element type of block argument or block return type |
- |-----------------------------------|----------------------------------------|-----------------------------------------------------|
- | (A) `stablehlo.uniorm_quantize`   | quantized tensor                       | quantized tensor                                    |
- | (B) `stablehlo.uniorm_quantize`   | floating point                         | quantized tensor                                    |
- | (C) `stablehlo.uniorm_dequantize` | quantized tensor                       | floating point                                      |
- | (D) `stablehlo.convert`           | floating-point                         | integer                                             |
- | (E) `stablehlo.convert`           | integer                                | floating-point                                      |
- | (F) `stablehlo.convert`           | floating-point                         | floating-point                                      |
- | (G) `stablehlo.convert`           | integer                                | integer                                             |
- | (G) `stablehlo.convert`           | complex                                | complex                                             |
+ | Implicit type conversion op       | element type of operand or block return | element type of block argument or op return |
+ |-----------------------------------|-----------------------------------------|---------------------------------------------|
+ | (A) `stablehlo.uniform_quantize`  | quantized tensor                        | quantized tensor                            |
+ | (B) `stablehlo.uniform_quantize`  | floating point                          | quantized tensor                            |
+ | (C) `stablehlo.uniorm_dequantize` | quantized tensor                        | floating point                              |
+ | (D) `stablehlo.convert`           | floating-point                          | integer                                     |
+ | (E) `stablehlo.convert`           | integer                                 | floating-point                              |
+ | (F) `stablehlo.convert`           | floating-point                          | floating-point                              |
+ | (G) `stablehlo.convert`           | integer                                 | integer                                     |
+ | (H) `stablehlo.convert`           | complex                                 | complex                                     |
 
 At this point there is no use for cases other than (A), (F), and (G).  My
 proposal here would be to address (A), (F), and (G) only.  Note that the (F)
-    partially addresses [Decide on mixed
-    precision](https://github.com/openxla/stablehlo/issues/369) for reduce op in
-    that it allows the the input or init value to differ from the corresponding
-    block arguments w.r.t the precision of floating-point types. However, the
-    mixed precision implementation in HLO seems more detailed in the sense that
-    even allows `inputs` and `init_values` to differ in floating-point
-    precision. My proposal would be to treat the above ticket separately.
+partially addresses
+[Decide on mixed precision](https://github.com/openxla/stablehlo/issues/369)
+for reduce op in that it allows the input or init value to differ from the
+corresponding block arguments w.r.t the precision of floating-point types.
+However, the mixed precision implementation in HLO seems more detailed in the
+sense that even allows `inputs` and `init_values` to differ in floating-point
+precision. My proposal would be to treat the above ticket separately.
 
 ## Appendix
 
@@ -139,10 +139,10 @@ reduce_implicit_convert(reduce(input_slices_converted),
 * (C?) `baseline_element_type(inputs...) = baseline_element_type(results...)`.
 * (C?) `body` has type `tensor<E0>, ..., tensor<EN-1>, tensor<E0>, ...,`
        `tensor<EN-1>) -> (tensor<E0>, ..., tensor<EN-1>)` where
-       `is_integer(element_type(inputs[i])) = is_integer(element_type(Ei]` or
-       `is_float(element_type(inputs[i])) = is_float(element_type(Ei]` or
-       `is_complex(element_type(inputs[i])) = is_complex(element_type(Ei]` or
-       `is_quantized(element_type(inputs[i])) = is_quantized(element_type(Ei]`.
+       `is_integer(element_type(inputs[i])) = is_integer(element_type(E[i]))` or
+       `is_float(element_type(inputs[i])) = is_float(element_type(E[i]))` or
+       `is_complex(element_type(inputs[i])) = is_complex(element_type(E[i]))` or
+       `is_quantized(element_type(inputs[i])) = is_quantized(element_type(E[i]))`.
 * (C?) `shape(results...) = shape(inputs...)` except that the dimension
   sizes of `inputs...` corresponding to `dimensions` are not included.
 
@@ -170,10 +170,10 @@ portions of the spec which needs modification.
 * (C?) `baseline_element_type(inputs...) = baseline_element_type(results...)`.
 * (C?) `body` has type `tensor<E0>, ..., tensor<EN-1>, tensor<E0>, ...,`
        `tensor<EN-1>) -> (tensor<E0>, ..., tensor<EN-1>)` where
-       `is_integer(element_type(inputs[i])) = is_integer(element_type(Ei]` or
-       `is_float(element_type(inputs[i])) = is_float(element_type(Ei]` or
-       `is_complex(element_type(inputs[i])) = is_complex(element_type(Ei]` or
-       `is_quantized(element_type(inputs[i])) = is_quantized(element_type(Ei]`.
+       `is_integer(element_type(inputs[i])) = is_integer(element_type(E[i]))` or
+       `is_float(element_type(inputs[i])) = is_float(element_type(E[i]))` or
+       `is_complex(element_type(inputs[i])) = is_complex(element_type(E[i]))` or
+       `is_quantized(element_type(inputs[i])) = is_quantized(element_type(E[i]))`.
 
 ### Revised specification of select_and_scatter op
 
@@ -190,10 +190,10 @@ not need additional conversion functions associated with `select`. But the
 * (C3) `element_type(init_value) = element_type(operand)`.
 * (C?) `baseline_element_type(inputs...) = baseline_element_type(results...)`.
 * (C10) `scatter` has type `(tensor<E>, tensor<E>) -> tensor<E>` where
-       `is_integer(element_type(operand)) = is_integer(element_type(E]` or
-       `is_float(element_type(operand)) = is_float(element_type(E]` or
-       `is_complex(element_type(operand)) = is_complex(element_type(E]` or
-       `is_quantized(element_type(operand)) = is_quantized(element_type(E]`.
+       `is_integer(element_type(operand)) = is_integer(element_type(E))` or
+       `is_float(element_type(operand)) = is_float(element_type(E))` or
+       `is_complex(element_type(operand)) = is_complex(element_type(E))` or
+       `is_quantized(element_type(operand)) = is_quantized(element_type(E))`.
 <!-- markdownlint-enable line-length -->
 
 ### Action Plan
@@ -204,18 +204,18 @@ I propose to follow the action plan (order matters):
   op, taking the accumulation type into account, via [open
   pr](https://github.com/openxla/stablehlo/pull/1538).
 * Finalize the quantized specification of AllReduceOp, BatchNormTrainingOp,
-    BatchNormGradOp and ReduceScatterOp, whose semantics depend on ReduceOp,
-    via [open ticket](https://github.com/openxla/stablehlo/issues/1666).
+  BatchNormGradOp and ReduceScatterOp, whose semantics depend on ReduceOp,
+  via [open ticket](https://github.com/openxla/stablehlo/issues/1666).
 * Spec the behavior of `precision_config` in DotGeneralOp. [open
 issue](https://github.com/openxla/stablehlo/issues/755)
 * Consider adding `precision_config` in reduction op.  `precision_config`,
-currently used for `dot_general` and `convolution`, to override the precision
-specified by the input parameters, allowing the choice of low precision vs high
-precision computation. We should consider adding `precision_config` to all
-reduction based op as well. [need a ticket for this]
+  currently used for `dot_general` and `convolution`, to override the precision
+  specified by the input parameters, allowing the choice of low precision vs
+  high precision computation. We should consider adding `precision_config` to
+  all reduction based op as well. [need a ticket for this]
 * Consider adding `accumulation_type` to `dot_general`/`convolution op`. The
-attribute seems beneficial for ops like `dot_general` and `convolution` which
-does not have an explicit reduction function. [need a ticket for this item].
+  attribute seems beneficial for ops like `dot_general` and `convolution` which
+  does not have an explicit reduction function. [need a ticket for this item].
 
 ## Summary of previous proposals
 
@@ -340,9 +340,9 @@ Here we will informally propose the semantics of the additional functions
 * (-) The disadvantage of this representation is that the syntax is more
   verbose and requires significant changes to the specification.
 * (-) The extra input/output conversion blocks are surplus information.  The
-intent of conversion blocks is to capture the accumulation type needed to
-compute the accumulative operation on. The specification would benefit if the
-intent can be expressed succinctly.
+  intent of conversion blocks is to capture the accumulation type needed to
+  compute the accumulative operation on. The specification would benefit if the
+  intent can be expressed succinctly.
 
 ### Introduce accumulation type attribute
 

From 1a0eabd8c30283934c766c6e93d93901a976b95d Mon Sep 17 00:00:00 2001
From: Sandeep Dasgupta <sdasgup@google.com>
Date: Mon, 11 Sep 2023 18:25:37 +0000
Subject: [PATCH 09/11] elaborate on the why #369 is not entirely part of the
 current proposal

---
 rfcs/20230622-quantized-reduction.md | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/rfcs/20230622-quantized-reduction.md b/rfcs/20230622-quantized-reduction.md
index 3ec7a8a26af..e0418cddb52 100644
--- a/rfcs/20230622-quantized-reduction.md
+++ b/rfcs/20230622-quantized-reduction.md
@@ -95,8 +95,18 @@ partially addresses
 for reduce op in that it allows the input or init value to differ from the
 corresponding block arguments w.r.t the precision of floating-point types.
 However, the mixed precision implementation in HLO seems more detailed in the
-sense that even allows `inputs` and `init_values` to differ in floating-point
-precision. My proposal would be to treat the above ticket separately.
+following sense:
+
+* [Decide on mixed precision](https://github.com/openxla/stablehlo/issues/369)
+allows `inputs` and `init_values` to differ in floating-point precision.
+Whereas, the current proposal considers them to have the same element type.
+* [Decide on mixed precision](https://github.com/openxla/stablehlo/issues/369)
+allows the element type of block arguments to differ from that of the block
+return value. The current proposal considers them to have the same element type.
+* There are other ops (than reduce) which need support for mixed precision (here
+is the [list of ops](https://github.com/tensorflow/tensorflow/blob/1d69ba72834b963b72075a82c10959f6bb74e473/tensorflow/compiler/xla/service/hlo_verifier.cc#L1681-L1714)).
+
+Having said that, my proposal would be to treat the above ticket separately.
 
 ## Appendix
 

From b5595da2b4aa9fcdb35acd5d1c991f6764fa598e Mon Sep 17 00:00:00 2001
From: Sandeep Dasgupta <sdasgup@google.com>
Date: Wed, 4 Oct 2023 22:27:24 +0000
Subject: [PATCH 10/11] make it clear in the rfc that the poposed semantics is
 optional

---
 rfcs/20230622-quantized-reduction.md | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/rfcs/20230622-quantized-reduction.md b/rfcs/20230622-quantized-reduction.md
index e0418cddb52..739b80eec57 100644
--- a/rfcs/20230622-quantized-reduction.md
+++ b/rfcs/20230622-quantized-reduction.md
@@ -41,15 +41,14 @@ The RFC introduces the following proposal, emerged out of discussion in the
 [thread](https://github.com/openxla/stablehlo/pull/1538#issuecomment-1599476906)
 , along with their tradeoffs.
 
-The proposal allows the reducer block to express the computation in a different
-element type (preferably wider accumulation type) than the one used in reduce
-op's ops arguments and return type. For illustrative purposes, in the following
-example, the operand element type
-`tensor<!quant.uniform<ui8:f32, input_scale:input_zp>>`  is different from the
-element type for reduction region's block arguments.  Similarly, the element
-type of the reduce op's result
-`!quant.uniform<ui8:f32, output_scale:output_zp>>` is different from that of
-block return (`tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>`).
+The proposal optionally allows the reducer block to express the computation in a
+different element type (preferably wider accumulation type) than the one used in
+reduce op's ops arguments and return type. For illustrative purposes, in the
+following example, the operand element type `tensor<!quant.uniform<ui8:f32,
+input_scale:input_zp>>`  is different from the element type for reduction
+region's block arguments.  Similarly, the element type of the reduce op's
+result `!quant.uniform<ui8:f32, output_scale:output_zp>>` is different from
+that of block return (`tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>`).
 
 ```mlir
 %result = "stablehlo.reduce"(%input, %init_value) ({
@@ -71,11 +70,14 @@ block return (`tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>`).
 
 ### Semantics
 
-If (1) the input operand type is different from the reduction block
-argument type or (2) the op result type is different from the reduction block
-return type, there will be implicit type conversion defined by either
+If (1) the input operand type is different from the reduction block argument
+type or (2) the op result type is different from the reduction block return
+type, there will be implicit type conversion defined by either
 `stablehlo.convert`, `stablehlo.uniform_quantize`, or
-`stablehlo.uniform_dequantize`. For example,
+`stablehlo.uniform_dequantize`. When the types are not differnet, i.e., when (1)
+and (2) does not hold true, then no implicit convertion is needed.
+
+For example,
 
  | Implicit type conversion op       | element type of operand or block return | element type of block argument or op return |
  |-----------------------------------|-----------------------------------------|---------------------------------------------|

From 4cb20669bacdd7634d16b622a6cc132e475c4079 Mon Sep 17 00:00:00 2001
From: Sandeep Dasgupta <sdasgup@google.com>
Date: Mon, 9 Oct 2023 16:04:41 +0000
Subject: [PATCH 11/11] typo fixes

---
 rfcs/20230622-quantized-reduction.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/rfcs/20230622-quantized-reduction.md b/rfcs/20230622-quantized-reduction.md
index 739b80eec57..678390ec4db 100644
--- a/rfcs/20230622-quantized-reduction.md
+++ b/rfcs/20230622-quantized-reduction.md
@@ -72,7 +72,7 @@ that of block return (`tensor<!quant.uniform<i32:f32, accum_scale:accum_zp>>`).
 
 If (1) the input operand type is different from the reduction block argument
 type or (2) the op result type is different from the reduction block return
-type, there will be implicit type conversion defined by either
+type, there will be an implicit type conversion defined by either
 `stablehlo.convert`, `stablehlo.uniform_quantize`, or
 `stablehlo.uniform_dequantize`. When the types are not differnet, i.e., when (1)
 and (2) does not hold true, then no implicit convertion is needed.
@@ -180,7 +180,7 @@ portions of the spec which needs modification.
 
 * (C?) `element_type(inputs...) = element_type(init_values...)`.
 * (C?) `baseline_element_type(inputs...) = baseline_element_type(results...)`.
-* (C?) `body` has type `tensor<E0>, ..., tensor<EN-1>, tensor<E0>, ...,`
+* (C?) `body` has type `(tensor<E0>, ..., tensor<EN-1>, tensor<E0>, ...,`
        `tensor<EN-1>) -> (tensor<E0>, ..., tensor<EN-1>)` where
        `is_integer(element_type(inputs[i])) = is_integer(element_type(E[i]))` or
        `is_float(element_type(inputs[i])) = is_float(element_type(E[i]))` or