From 56c9bcba365ba57442f9cac14048193bc9bcf764 Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Mon, 16 Mar 2020 09:00:58 -0700
Subject: [PATCH 01/14] [SYCL][Doc] Update sub-group extension docs

Splits sub-group functionality into three extensions:
- SubGroup (sub_group class and device queries)
- SubGroupAlgorithms (GroupAlgorithm support and permute)
- GroupMask (sub_group::mask_type and ballot)

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 sycl/doc/extensions/GroupMask/README.md       |   3 +
 .../GroupMask/SYCL_INTEL_group_mask.asciidoc  | 244 +++++++++++++++
 sycl/doc/extensions/SubGroup/README.md        |   4 +
 .../SubGroup/SYCL_INTEL_sub_group.asciidoc    | 251 +++++++++++++++
 .../extensions/SubGroupAlgorithms/README.md   |   3 +
 .../SYCL_INTEL_sub_group_algorithms.asciidoc  | 177 +++++++++++
 .../SubGroupNDRange/SubGroupNDRange.md        | 286 ------------------
 7 files changed, 682 insertions(+), 286 deletions(-)
 create mode 100644 sycl/doc/extensions/GroupMask/README.md
 create mode 100755 sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
 create mode 100644 sycl/doc/extensions/SubGroup/README.md
 create mode 100755 sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
 create mode 100644 sycl/doc/extensions/SubGroupAlgorithms/README.md
 create mode 100755 sycl/doc/extensions/SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc
 delete mode 100644 sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md

diff --git a/sycl/doc/extensions/GroupMask/README.md b/sycl/doc/extensions/GroupMask/README.md
new file mode 100644
index 0000000000000..19b93dccd8b0c
--- /dev/null
+++ b/sycl/doc/extensions/GroupMask/README.md
@@ -0,0 +1,3 @@
+# SYCL_INTEL_group_mask
+
+A new `group_mask` class providing an ability to efficiently represent subsets of work-items in a group for which a given Boolean condition holds.
diff --git a/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc b/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
new file mode 100755
index 0000000000000..0a1cc753ae60c
--- /dev/null
+++ b/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
@@ -0,0 +1,244 @@
+= SYCL_INTEL_group_mask
+:source-highlighter: coderay
+:coderay-linenums-mode: table
+
+// This section needs to be after the document title.
+:doctype: book
+:toc2:
+:toc: left
+:encoding: utf-8
+:lang: en
+
+:blank: pass:[ +]
+
+// Set the default source code type in this document to C++,
+// for syntax highlighting purposes.  This is needed because
+// docbook uses c++ and html5 uses cpp.
+:language: {basebackend@docbook:c++:cpp}
+
+== Introduction
+IMPORTANT: This specification is a draft.
+
+NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc. used by permission by Khronos.
+
+NOTE: This document is better viewed when rendered as html with asciidoctor.  GitHub does not render image icons.
+
+This document describes an extension which adds a +group_mask+ type.  Such a mask can be used to efficiently represent subsets of work-items in a group for which a given Boolean condition holds.  Group mask functionality is currently limited to groups that are instances of the +sub_group+ class.
+
+== Name Strings
+
++SYCL_INTEL_group_mask+
+
+== Notice
+
+Copyright (c) 2020 Intel Corporation.  All rights reserved.
+
+== Status
+
+Working Draft
+
+This is a preview extension specification, intended to provide early access to a feature for review and community feedback. When the feature matures, this specification may be released as a formal extension.
+
+Because the interfaces defined by this specification are not final and are subject to change they are not intended to be used by shipping software products.
+
+== Version
+
+Built On: {docdate} +
+Revision: 1
+
+== Contact
+John Pennycook, Intel (john 'dot' pennycook 'at' intel 'dot' com)
+
+== Dependencies
+
+This extension is written against the SYCL 1.2.1 specification, Revision v1.2.1-6 and the following extensions:
+
+- +SYCL_INTEL_sub_group+
+
+== Overview
+
+A group mask is an integral type sized such that each work-item in the group is represented by a single bit. Such a mask can be used to efficiently represent subsets of work-items in a group for which a given Boolean condition holds.
+
+Group mask functionality is currently limited to groups that are instances of the +sub_group+ class, but this limitation may be lifted in a future version of the specification.
+
+=== Ballot
+
+The +ballot+ algorithm converts a Boolean condition from each work-item in the group into a group mask.  Like other group algorithms, +ballot+ must be encountered by all work-items in the group in converged control flow.
+
+|===
+|Member Functions|Description
+
+|+template <typename Group> Group::mask_type ballot(bool predicate = true) const+
+|Return a +group_mask+ representing the set of work-items in the group for which _predicate_ is +true+.
+|===
+
+=== Group Masks
+
+The group mask type is an opaque type, permitting implementations to use any mask representation subject to the following restrictions:
+
+- The size and alignment of the mask type must be the same on the host and device
+- A SYCL implementation supporting OpenCL interoperability must use a 128-bit mask convertible to a +vec<uint,4>+
+
+Functions declared in the +mask+ class can be called independently by different work-items in the same group.  An instance of a group class (e.g. +group+ or +sub_group+) is not required to manipulate a group mask.
+
+The mask is defined such that the least significant bit (LSB) corresponds to the work-item with id 0, and the most significant bit (MSB) corresponds to the work-item with the id +max_local_range()-1+.
+
+|===
+|Member Function|Description
+
+|+bool operator[](id<1> id) const+
+|Return +true+ if the bit corresponding to the specified _id_ is set in the mask.
+
+|+mask::reference operator[](id<1> id) const+
+|Return a reference to the bit corresponding to the specified _id_ in the mask.
+
+|+bool test(id<1> id) const+
+|Return +true+ if the bit corresponding to the specified _id_ is set in the mask.
+
+|+bool all() const+
+|Return +true+ if all bits in the mask are set.
+
+|+bool any() const+
+|Return +true+ if any bits in the mask are set.
+
+|+bool none() const+
+|Return +true+ if none of the bits in the mask are set.
+
+|+uint32_t count() const+
+|Return the number of bits set in the mask.
+
+|+uint32_t size() const+
+|Return the number of bits in the mask.
+
+|+id<1> find_low() const+
+|Return the lowest +id+ with a corresponding bit set in the mask. If no bits are set, the return value is equal to `size()`.
+
+|+id<1> find_high() const+
+|Return the highest +id+ with a corresponding bit set in the mask. If no bits are set, the return value is equal to `size()`.
+
+|+template <typename T = vec<uint32_t,4>> void insert_bits(T bits, id<1> pos = 0)+
+|Insert `CHAR_BIT * sizeof(T)` bits into the mask, starting from _pos_.  `T` must be an integral type of a SYCL vector of integral types.  _pos_ must be a multiple of `CHAR_BIT * sizeof(T)` in the range [0, `size()`).  If _pos_ + `CHAR_BIT * sizeof(T)` is greater than `size()`, the final `size()` - (_pos_ + `CHAR_BIT * sizeof(T)`) bits are ignored.
+
+|+template <typename T = vec<uint32_t,4>> T extract_bits(id<1> pos = 0) const+
+|Return `CHAR_BIT * sizeof(T)` bits from the mask, starting from _pos_.  `T` must be an integral type or a SYCL vector of integral types.  _pos_ must be a multiple of `CHAR_BIT * sizeof(T)` in the range [0, `size()`).  If _pos_ + `CHAR_BIT * sizeof(T)` is greater than `size()`, the final `size()` - (_pos_ + `CHAR_BIT * sizeof(T)`) bits of the return value are zero.
+
+|+void set()+
+|Set all bits in the mask to true.
+
+|+void set(id<1> id, bool value = true)+
+|Set the bit corresponding to the specified _id_ to the value specified by _value_.
+
+|+void reset()+
+|Reset all bits in the mask.
+
+|+void reset(id<1> id)+
+|Reset the bit corresponding to the specified _id_.
+
+|+void reset_low()+
+|Reset the bit for the lowest +id+ with a corresponding bit set in the mask.  Functionally equivalent to +reset(find_low())+.
+
+|+void reset_high()+
+|Reset the bit for the highest +id+ with a corresponding bit set in the mask.  Functionally equivalent to +reset(find_high())+.
+
+|+void flip()+
+|Toggle the values of all bits in the mask.
+
+|+void flip(id<1> id)+
+|Toggle the value of the bit corresponding to the specified _id_.
+
+|===
+
+==== Sample Header
+
+[source, c++]
+----
+namespace cl {
+namespace sycl {
+namespace intel {
+
+struct group_mask {
+
+  // enable reference to individual bit
+  struct reference {
+    reference& operator=(bool x);
+    reference& operator=(const reference& x);
+    bool operator~() const;
+    operator bool() const;
+    reference& flip();
+  };
+
+  bool operator[](id<1> id) const;
+  reference operator[](id<1> id) const;
+  bool test(id<1> id) const;
+  bool all() const;
+  bool any() const;
+  bool none() const;
+  uint32_t count() const;
+  uint32_t size() const;
+  id<1> find_low() const;
+  id<1> find_high() const;
+
+  template <typename T = vec<uint32_t,4>>
+  void insert_bits(T bits, id<1> pos = 0);
+
+  template <typename T = vec<uint32_t,4>>
+  T extract_bits(id<1> pos = 0);
+
+  void set();
+  void set(id<1> id, bool value = true);
+  void reset();
+  void reset(id<1> id);
+  void reset_low();
+  void reset_high();
+  void flip();
+  void flip(id<1> id);
+
+  bool operator==(mask rhs) const;
+  bool operator!=(mask rhs) const;
+
+  mask operator &=(mask rhs);
+  mask operator |=(mask rhs);
+  mask operator ^=(mask rhs);
+  mask operator ~() const;
+  mask operator <<=(mask rhs);
+  mask operator >>=(mask rhs);
+
+  mask operator &(mask rhs) const;
+  mask operator |(mask rhs) const;
+  mask operator ^(mask rhs) const;
+
+};
+
+} // intel
+} // sycl
+} // cl
+----
+
+== Issues
+
+None.
+
+//. asd
+//+
+//--
+//*RESOLUTION*: Not resolved.
+//--
+
+== Revision History
+
+[cols="5,15,15,70"]
+[grid="rows"]
+[options="header"]
+|========================================
+|Rev|Date|Author|Changes
+|1|2020-03-16|John Pennycook|*Initial public working draft*
+|========================================
+
+//************************************************************************
+//Other formatting suggestions:
+//
+//* Use *bold* text for host APIs, or [source] syntax highlighting.
+//* Use +mono+ text for device APIs, or [source] syntax highlighting.
+//* Use +mono+ text for extension names, types, or enum values.
+//* Use _italics_ for parameters.
+//************************************************************************
diff --git a/sycl/doc/extensions/SubGroup/README.md b/sycl/doc/extensions/SubGroup/README.md
new file mode 100644
index 0000000000000..b40d4aea4991c
--- /dev/null
+++ b/sycl/doc/extensions/SubGroup/README.md
@@ -0,0 +1,4 @@
+# SYCL_INTEL_sub_group
+
+A new `sub_group` class representing an implementation-defined grouping of work-items in a work-group.
+
diff --git a/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
new file mode 100755
index 0000000000000..2f463ecfa1959
--- /dev/null
+++ b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
@@ -0,0 +1,251 @@
+= SYCL_INTEL_sub_group
+:source-highlighter: coderay
+:coderay-linenums-mode: table
+
+// This section needs to be after the document title.
+:doctype: book
+:toc2:
+:toc: left
+:encoding: utf-8
+:lang: en
+
+:blank: pass:[ +]
+
+// Set the default source code type in this document to C++,
+// for syntax highlighting purposes.  This is needed because
+// docbook uses c++ and html5 uses cpp.
+:language: {basebackend@docbook:c++:cpp}
+
+== Introduction
+IMPORTANT: This specification is a draft.
+
+NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc. used by permission by Khronos.
+
+NOTE: This document is better viewed when rendered as html with asciidoctor.  GitHub does not render image icons.
+
+This document describes an extension which introduces a new +sub_group+ class representing an implementation-defined grouping of work-items in a work-group.
+
+== Name Strings
+
++SYCL_INTEL_sub_group+
+
+== Notice
+
+Copyright (c) 2020 Intel Corporation.  All rights reserved.
+
+== Status
+
+Working Draft
+
+This is a preview extension specification, intended to provide early access to a feature for review and community feedback. When the feature matures, this specification may be released as a formal extension.
+
+Because the interfaces defined by this specification are not final and are subject to change they are not intended to be used by shipping software products.
+
+== Version
+
+Built On: {docdate} +
+Revision: 1
+
+== Contact
+John Pennycook, Intel (john 'dot' pennycook 'at' intel 'dot' com)
+
+== Dependencies
+
+This extension is written against the SYCL 1.2.1 specification, Revision v1.2.1-6.
+
+== Overview
+
+A sub-group represents an implementation-defined grouping of work-items in a work-group. The work-items within a sub-group can communicate and synchronize independently of work-items in other sub-groups, and sub-groups are therefore commonly mapped to SIMD hardware where it exists.
+
+The first version of this document is focused on exposing sub-group functionality to the NDRange form of SYCL +parallel_for+, and does not address hierarchical parallelism.
+
+=== Towards a Generic Group Abstraction
+
+Providing a generic group abstraction encapsulating the shared functionality of all synchronizable SYCL groups (i.e. work-groups and sub-groups) in a single interface would enable users to write more general code and simplify the introduction of additional SYCL groups in the future (e.g. device-wide synchronization groups).  Some names in this proposal are chosen to demonstrate how this may look:
+
+- The common interface members of +sub_group+ do not reference sub-groups by name, opting instead for generic names like +get_group_range()+.
+- +get_enqueued_num_sub_groups()+ is exposed as +get_uniform_group_range()+ since future generic groups may not be 'enqueued' but may still be non-uniform.
+- +sub_group+ defines a number of types and static members to simplify writing generic code.
+
+=== Attributes
+
+The +[[intel::reqd_sub_group_size(n)]]+ attribute indicates that the kernel must be compiled and executed with a sub-group of size _n_.  The value of _n_ must be set to a sub-group size that is both supported by the device and compatible with all language features used by the kernel, or device compilation will fail.  The set of valid sub-group sizes can be queried as described below.
+
+In addition to device functions, the required sub-group size attribute may also be specified in the definition of a named functor object, as in the example below:
+
+[source, c++]
+----
+class Functor
+{
+  void operator()(item<1> item) [[intel::reqd_sub_group_size(16)]]
+  {
+    /* kernel code */
+  }
+}
+----
+
+It is illegal for a kernel or function to call a function with a mismatched sub-group size requirement, and the compiler should produce an error in this case.
+
+=== Sub-group Queries
+
+Several aspects of sub-group functionality are implementation-defined: the size and number of sub-groups is implementation-defined (and may differ for each kernel); and different devices may make different guarantees with respect to how sub-groups within a work-group are scheduled.  Developers can query these behaviors at a device level and for individual kernels.
+
+To maximize portability across devices, developers should not assume that work-items within a sub-group execute in lockstep, nor that two sub-groups within a work-group will make independent forward progress with respect to one another.
+
+|===
+|Device Descriptors|Return Type|Description
+
+|+info::device::max_num_sub_groups+
+|+uint32_t+
+|Returns the maximum number of sub-groups in a work-group for any kernel executed on the device.  The minimum value is 1.
+
+|+info::device::sub_group_independent_forward_progress+
+|+bool+
+|Returns +true+ if the device supports independent forward progress of sub-groups with respect to other sub-groups in the same work-group.
+
+|+info::device::sub_group_sizes+
+|+vector_class<size_t>+
+|Returns a vector_class of +size_t+ containing the set of sub-group sizes supported by the device.
+|===
+
+|===
+|Member Functions|Description
+
+|+template <info::kernel_sub_group param>typename info::param_traits<info::kernel_sub_group, param>::return_type get_sub_group_info(const device &dev) const+
+|Query information from the sub-group from a kernel using the +info::kernel_sub_group+ descriptor for a specific device.
+|===
+
+|===
+|Kernel Descriptors|Return Type|Description
+
+|+info::kernel_sub_group::max_num_sub_groups+
+|+uint32_t+
+|Returns the maximum number of sub-groups for this kernel.
+
+|+info::kernel_sub_group::compile_num_sub_groups+
+|+uint32_t+
+|Returns the number of sub-groups specified by the kernel, or 0 (if not specified).
+
+|+info::kernel_sub_group::max_sub_group_size+
+|+uint32_t+
+|Returns the maximum sub-group size for this kernel.
+
+|+info::kernel_sub_group::compile_sub_group_size+
+|+uint32_t+
+|Returns the required sub-group size specified by the kernel, or 0 (if not specified).
+|===
+
+=== The sub_group Class
+
+The +sub_group+ class encapsulates all functionality required to represent a particular sub-group within a parallel execution.  It has common by-value semantics and is not default or user-constructible, and can only be accessed via methods in the +nd_item+ class.
+
+|===
+|Member Functions|Description
+
+|+sub_group get_sub_group() const+
+|Return the sub-group to which the work-item belongs.
+|===
+
+The member functions of the sub-group class provide a mechanism for a developer to query properties of a sub-group and a work-item's position in it.
+
+|===
+|Member Functions|Description
+
+|+id<1> get_local_id() const+
+|Return an +id+ representing the index of the work-item within the sub-group.
+
+|+uint32_t get_linear_local_id() const+
+|Return a +uint32_t+ representing the index of the work-item within the sub-group.
+
+|+range<1> get_local_range() const+
+|Return a SYCL +range+ representing the number of work-items in the sub-group.
+
+|+range<1> get_max_local_range() const+
+|Return a SYCL +range+ representing the maximum number of work-items in any sub-group within the nd-range.
+
+|+id<1> get_group_id() const+
+|Return an +id+ representing the index of the sub-group within the work-group.
+
+|+uint32_t get_linear_group_id() const+
+|Return a +uint32_t+ representing the index of the sub-group within the work-group.
+
+|+range<1> get_group_range() const+
+|Return a SYCL +range+ representing the number of sub-groups within the work-group.
+
+|+range<1> get_uniform_group_range() const+
+|Return a SYCL +range+ representing the number of sub-groups per work-group in the uniform region of the nd-range.
+|===
+
+An example usage of the +sub_group+ class is given below:
+
+[source, c++]
+----
+parallel_for<class kernel>(..., [&](nd_item item)
+{
+  sub_group sg = item.get_sub_group();
+  for (int v = sg.get_local_id(); v < N; v += sg.get_local_range())
+  {
+    ...
+  }
+});
+----
+
+==== Sample Header
+
+[source, c++]
+----
+namespace cl {
+namespace sycl {
+namespace intel {
+struct sub_group {
+
+  using id_type = id<1>;
+  using range_type = range<1>;
+  using linear_id_type = uint32_t;
+  static constexpr int32_t dimensions = 1;
+
+  id_type get_local_id() const;
+  linear_id_type get_local_linear_id() const;
+  range_type get_local_range() const;
+  range_type get_max_local_range() const;
+
+  id_type get_group_id() const;
+  linear_id_type get_group_linear_id() const;
+  range_type get_group_range() const;
+  range_type get_uniform_group_range() const;
+
+};
+} // intel
+} // sycl
+} // cl
+----
+
+== Issues
+
+None.
+
+//. asd
+//+
+//--
+//*RESOLUTION*: Not resolved.
+//--
+
+== Revision History
+
+[cols="5,15,15,70"]
+[grid="rows"]
+[options="header"]
+|========================================
+|Rev|Date|Author|Changes
+|1|2019-04-19|John Pennycook|*Initial public working draft*
+|2|2020-03-16|John Pennycook|*Separate class definition from algorithms*
+|========================================
+
+//************************************************************************
+//Other formatting suggestions:
+//
+//* Use *bold* text for host APIs, or [source] syntax highlighting.
+//* Use +mono+ text for device APIs, or [source] syntax highlighting.
+//* Use +mono+ text for extension names, types, or enum values.
+//* Use _italics_ for parameters.
+//************************************************************************
diff --git a/sycl/doc/extensions/SubGroupAlgorithms/README.md b/sycl/doc/extensions/SubGroupAlgorithms/README.md
new file mode 100644
index 0000000000000..12ee30b3ba1b3
--- /dev/null
+++ b/sycl/doc/extensions/SubGroupAlgorithms/README.md
@@ -0,0 +1,3 @@
+# SYCL_INTEL_sub_group_algorithms
+
+A library of sub-group functions, including communication patterns such as permutes and shifts.
diff --git a/sycl/doc/extensions/SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc b/sycl/doc/extensions/SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc
new file mode 100755
index 0000000000000..ab998730bc597
--- /dev/null
+++ b/sycl/doc/extensions/SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc
@@ -0,0 +1,177 @@
+= SYCL_INTEL_sub_group_algorithms
+:source-highlighter: coderay
+:coderay-linenums-mode: table
+
+// This section needs to be after the document title.
+:doctype: book
+:toc2:
+:toc: left
+:encoding: utf-8
+:lang: en
+
+:blank: pass:[ +]
+
+// Set the default source code type in this document to C++,
+// for syntax highlighting purposes.  This is needed because
+// docbook uses c++ and html5 uses cpp.
+:language: {basebackend@docbook:c++:cpp}
+
+== Introduction
+IMPORTANT: This specification is a draft.
+
+NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc. used by permission by Khronos.
+
+NOTE: This document is better viewed when rendered as html with asciidoctor.  GitHub does not render image icons.
+
+This document describes an extension which introduces a library of sub-group functions, including communication patterns such as permutes and shifts.
+
+== Name Strings
+
++SYCL_INTEL_sub_group_algorithms+
+
+== Notice
+
+Copyright (c) 2020 Intel Corporation.  All rights reserved.
+
+== Status
+
+Working Draft
+
+This is a preview extension specification, intended to provide early access to a feature for review and community feedback. When the feature matures, this specification may be released as a formal extension.
+
+Because the interfaces defined by this specification are not final and are subject to change they are not intended to be used by shipping software products.
+
+== Version
+
+Built On: {docdate} +
+Revision: 1
+
+== Contact
+John Pennycook, Intel (john 'dot' pennycook 'at' intel 'dot' com)
+
+== Dependencies
+
+This extension is written against the SYCL 1.2.1 specification, Revision v1.2.1-6 and the following extensions:
+
+- +SYCL_INTEL_group_algorithms+
+- +SYCL_INTEL_sub_group+
+
+== Overview
+
+This extension introduces a library of sub-group functions, providing a simple way for developers to apply common parallel algorithms across data held by different work-items in the same sub-group, or to use the work-items of a sub-group to apply common parallel algorithms across data stored in memory.
+
+This extension adds sub-group support to all of the functions from +SYCL_INTEL_group_algorithms+:
+
+- +any_of+
+- +all_of+
+- +none_of+
+- +broadcast+
+- +leader+
+- +reduce+
+- +exclusive_scan+
+- +inclusive_scan+
+
+It additionally introduces a number of functions that are currently specific to sub-groups:
+
+- +permute+
+- +shift_left+
+- +shift_right+
+- +xor_permute+
+- +load+
+- +store+
+
+=== Data Types
+
+All functions are supported for the fundamental scalar types supported by SYCL and instances of the SYCL +vec+ class. The fundamental scalar types (as defined in Section 6.5 of the SYCL 1.2.1 specification) are: +bool+, +char+, +signed char+, +unsigned char+, +short int+, +unsigned short int+, +int+, +unsigned int+, +long int+, +unsigned long int+, +long long int+, +unsigned long long int+, +size_t+, +float+, +double+, +half+.
+
+Functions with arguments of type +vec<T,N>+ are applied component-wise: they are semantically equivalent to N calls to a scalar function of type +T+.
+
+=== Functions
+
+The sub-group algorithms library is based on the algorithms library described in Section 28 of the {cpp}17 standard.  The syntax and restrictions are aligned, with two notable differences: the first argument to each function is a sub-group of work-items, in place of an execution policy; and pointers are accepted in place of iterators in order to guarantee that address space information is visible to the compiler.
+
+Sub-group algorithms are performed collaboratively by the work-items in a sub-group. All functions therefore act as synchronization points and must be encountered in converged control flow by all work-items in the sub-group -- if one work-item in the sub-group reaches the function, then all work-items in the sub-group must reach the function.  Additionally, restrictions may be placed on the arguments passed to each function in order to ensure that all work-items in the sub-group agree on the operation that is being performed.  Any such restrictions on the arguments passed to a function are defined within the descriptions of those functions.
+
+Many functions provide at least two overloads: one operating directly on data produced by the work-items in the sub-group, and another operating on a range of data in memory specified by a pair of pointers.  If the pointers passed to such a sub-group function are not the same for all work-items in the sub-group, their behavior is undefined.  How the elements of a range are processed by the work-items in a sub-group is undefined.
+
+It is undefined behavior for any of these functions to be invoked within a +parallel_for_work_group+ or +parallel_for_work_item+ context, but this restriction may be lifted in a future version of the proposal.
+
+==== Votes and Collectives
+
+Each of these functions from +SYCL_INTEL_group_library+ is supported for sub-groups.  The definitions are identical, except an instance of the +sub_group+ class is accepted as the first argument in place of an instance of the +group+ class.
+
+==== Permutes and Shifts
+
+The permute sub-group functions perform arbitrary communication between pairs of work-items in a sub-group.  Common patterns -- such as shifting all values in a sub-group by a fixed number of work-items, or reversing the order of all values in a sub-group -- are exposed as specialized functions that may be accelerated in hardware.
+
+|===
+|Function|Description
+
+|+template <typename T> T permute(sub_group sg, T x, id<1> local_id)+
+|Exchange values of _x_ between work-items in the sub-group in an arbitrary pattern.  Returns the value of _x_ from the work-item with the specified id.  The value of _local_id_ must be between 0 and the sub-group size.
+
+|+template <typename T> T shift_left(sub_group sg, T x, uint32_t delta)+
+|Exchange values of _x_ between work-items in the sub-group via a shift.  Returns the value of _x_ from the work-item whose id is _delta_ larger than the calling work-item. The value returned when the result of id + _delta_ is greater than or equal to the sub-group size is undefined.  The value of _delta_ must be the same for all work-items in the sub-group.
+
+|+template <typename T> T shift_right(sub_group sg, T x, uint32_t delta)+
+|Exchange values of _x_ between work-items in the sub-group via a shift.  Returns the value of _x_ from the work-item whose id is _delta_ smaller than the calling work-item.  The value of returned when the result of id - _delta_ is less than zero is undefined.  The value of _delta_ must be the same for all work-items in the sub-group.
+
+|+template <typename T> T xor_permute(sub_group sg, T x, id<1> mask)+
+|Exchange pairs of values of _x_ between work-items in the sub-group.  Returns the value of _x_ from the work-item whose id is equal to the exclusive-or of the calling work-item's id and _mask_. _mask_ must be a compile-time constant value that is the same for all work-items in the sub-group.
+
+|+template <typename T> T reverse(sub_group sg, T x)+
+|Exchange values of _x_ between work-items in the sub-group so as to reverse their order.  The value returned on work-item +i+ is the value of _x_ from the work-item whose id is equal to the sub-group size - +i+.
+
+|+template <typename T, class Compare> T sort(sub_group sg, T x, Compare comp)+
+|Exchange values of _x_ between work-items in the sub-group so as to reflect their ordering by the binary comparison function object _comp_.  _comp_ must be one of the comparison function objects from the group library.
+
+|===
+
+==== Loads and Stores
+
+The load and store sub-group functions enable developers to assert that all work-items in a sub-group read/write from/to contiguous locations in memory.  Such operations can be mapped directly to SIMD operations (when sub-groups are executed in SIMD fashion).
+
+|===
+|Function|Description
+
+|+template <typename T, access::address_space Space> T load(sub_group sg, const multi_ptr<T,Space> src)+
+|Load contiguous data from _src_.  Returns one element per work-item, corresponding to the memory location at _src_ + +get_local_id()+. The value of _src_ must be the same for all work-items in the sub-group.
+
+|+template <int N, typename T, access::address_space Space> vec<T,N> load(sub_group sg, const multi_ptr<T,Space> src)+
+|Load contiguous data from _src_.  Returns _N_ elements per work-item, corresponding to the _N_ memory locations at _src_ + +i+ * +get_max_local_range()+ + +get_local_id()+ for +i+ between 0 and _N_. The value of _src_ must be the same for all work-items in the sub-group.
+
+|+template <typename T, access::address_space Space> void store(sub_group sg, multi_ptr<T,Space> dst, const T& x)+
+|Store contiguous data to _dst_.  The value of _x_ from each work-item is written to the memory location at _dst_ + +get_local_id()+. The value of _dst_ must be the same for all work-items in the sub-group.
+
+|+template <int N, typename T, access::address_space Space> void store(sub_group sg, multi_ptr<T,Space> dst, const vec<T,N>& x)+
+|Store contiguous data to _dst_.  The _N_ elements from each work-item are written to the memory locations at _dst_ + +i+ * +get_max_local_range()+ + +get_local_id()+ for +i+ between 0 and _N_.  The value of _dst_ must be the same for all work-items in the sub-group. |
+|===
+
+== Issues
+
+None.
+
+//. asd
+//+
+//--
+//*RESOLUTION*: Not resolved.
+//--
+
+== Revision History
+
+[cols="5,15,15,70"]
+[grid="rows"]
+[options="header"]
+|========================================
+|Rev|Date|Author|Changes
+|1|2020-03-16|John Pennycook|*Initial public working draft*
+|========================================
+
+//************************************************************************
+//Other formatting suggestions:
+//
+//* Use *bold* text for host APIs, or [source] syntax highlighting.
+//* Use +mono+ text for device APIs, or [source] syntax highlighting.
+//* Use +mono+ text for extension names, types, or enum values.
+//* Use _italics_ for parameters.
+//************************************************************************
diff --git a/sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md b/sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md
deleted file mode 100644
index ce87622a96f97..0000000000000
--- a/sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md
+++ /dev/null
@@ -1,286 +0,0 @@
-# SYCL(TM) Proposals: Sub-groups for NDRange Parallelism
-
-**IMPORTANT**: This specification is a draft.
-
-**NOTE**: Khronos(R) is a registered trademark and SYCL(TM) is a trademark of the Khronos Group, Inc.
-
-A _sub-group_ represents an implementation-defined grouping of work-items in a work-group. The work-items within a sub-group can communicate and synchronize independently of work-items in other sub-groups, and sub-groups are therefore commonly mapped to SIMD hardware where it exists.
-
-Sub-groups have been part of the OpenCL execution model since OpenCL 2.0, but many important functions are missing: several hardware features are exposed only as vendor-specific extensions, and functions common in other programming models are not exposed at all.  This proposal defines SYCL syntax and semantics for the core OpenCL functionality, but also seeks to expose some of these missing functions.
-
-The first version of this document is focused on exposing sub-group functionality to the NDRange form of SYCL `parallel_for`, and does not address hierarchical parallelism.
-
-## Alignment with OpenCL vs C++
-
-Where a feature is common to both OpenCL and C++, this proposal opts for C++-like naming:
-- Collective operators are named as in `<functional>` (e.g. `plus` instead of `sum`) and to avoid clashes with names in `<algorithm>` (e.g. `minimum` instead of `min`).
-- Scan operations are named as in `<algorithm>` (e.g. `inclusive_scan` instead of `scan_inclusive`).
-
-## Towards a Generic Group Abstraction
-
-Providing a generic group abstraction encapsulating the shared functionality of all synchronizable SYCL groups (i.e. work-groups and sub-groups) in a single interface would enable users to write more general code and simplify the introduction of additional SYCL groups in the future (e.g. device-wide synchronization groups).  Some names in this proposal are chosen to demonstrate how this may look:
-- The common interface members of `sub_group` do not reference sub-groups by name, opting instead for generic names like `get_group_range()`.
-- `get_enqueued_num_sub_groups()` is exposed as `get_uniform_group_range()`, since future generic groups may not be 'enqueued' but may still be non-uniform.
-- `barrier()` is exposed as a member of the `sub_group` class rather than as a member of the `nd_item` class.
-
-## Data Types
-
-Many aspects of sub-group behavior are implementation-defined and/or device-specific.  In order to maximize the portability of code written to utilize the sub-group class, all functions are supported for the fundamental standard scalar types supported by SYCL (see Section 6.5 of the SYCL 1.2.1 specification): `bool`, `char`, `signed char`, `unsigned char`, `short int`, `unsigned short int`, `int`, `unsigned int`, `long int`, `unsigned long int`, `long long int`, `unsigned long long int`, `size_t`, `float`, `double`, `half`.
-
-# Attributes
-
-In keeping with Section 6.7 of the SYCL 1.2.1 specification, attributes are made available as a C++11 attribute specifier in the `cl` namespace, and the attributes of a kernel are the sum of all the kernel attributes of all device functions called.  Attribute names are prefixed with `intel` to denote that they are Intel extensions.
-
-## Required Sub-group Size
-
-The `[[cl::intel_reqd_sub_group_size(n)]]` attribute indicates that the kernel must be compiled and executed with a sub-group of size `n`.  The value of `n` must be set to a sub-group size supported by the device, or device compilation will fail.
-
-In addition to device functions, the required sub-group size attribute may also be specified in the definition of a named functor object and lambda functions, as in the examples below:
-
-```c++
-class Functor
-{
-    void operator()(item<1> item) [[cl::intel_reqd_sub_group_size(16)]]
-    {
-        /* kernel code */
-    }
-}
-
-kernel<class kernel_name>(
-[]() [[cl::intel_reqd_sub_group_size(n)]] {
-     /* kernel code */
-});
-```
-
-# Sub-group Queries
-
-Under the OpenCL execution model (see Section 3.2.2 of the OpenCL 2.2 specification), several aspects of sub-group functionality are implementation-defined: the size and number of sub-groups is implementation-defined (and may differ for each kernel); and different devices may make different guarantees with respect to how sub-groups within a work-group are scheduled.  Developers can query these behaviors at a device level and for individual kernels.
-
-To maximize portability across devices, developers should not assume that work-items within a sub-group execute in lockstep, nor that two sub-groups within a work-group will make independent forward progress with respect to one another.
-
-|Device descriptors|Return type|Description|
-|------------------|-----------|-----------|
-| `info::device::max_num_sub_groups` | `cl_uint` | Returns the maximum number of sub-groups in a work-group for any kernel executed on the device.  The minimum value is 1. |
-| `info::device::sub_group_independent_forward_progress` | `bool` | Returns `true` if the device supports independent forward progress of sub-groups with respect to other sub-groups in the same work-group. |
-| `info::device::sub_group_sizes` | `vector_class<size_t>` | Returns a vector_class of `size_t` containing the set of sub-group sizes supported by the device. |
-
-|Member functions|Description|
-|----------------|-----------|
-| `template <info::kernel_sub_group param>typename info::param_traits<info::kernel_sub_group, param>::return_type get_sub_group_info(const device &dev) const` | Query information from the sub-group from a kernel using the `info::kernel_sub_group` descriptor for a specific device. |
-| `template <info::kernel_sub_group param>typename info::param_traits<info::kernel_sub_group, param>::return_type get_sub_group_info(const device &dev, typename info::param_traits<info::kernel_sub_group, param>::input_type value) const` | Query information from the sub-group from a kernel using the `info::kernel_sub_group` descriptor for a specific device and input parameter. The expected value of the input parameter depends on the information being queried. |
-
-|Kernel descriptors|Input type|Return type|Description|
-|------------------|----------|-----------|-----------|
-| `info::kernel_sub_group::max_sub_group_size_for_ndrange` | `range<D>` | `uint32_t` | Returns the maximum sub-group size for the specified work-group size. |
-| `info::kernel_sub_group::sub_group_count_for_ndrange` | `range<D>` | `uint32_t` | Returns the number of sub-groups for the specified work-group size. |
-| `info::kernel_sub_group::local_size_for_sub_group_count` | `size_t` | `range<D>` | Returns a work-group size that will contain the specified number of sub-groups. |
-| `info::kernel_sub_group::max_num_sub_groups` | N/A | `uint32_t` | Returns the maximum number of sub-groups for this kernel. |
-| `info::kernel_sub_group::compile_num_sub_groups` | N/A | `uint32_t` | Returns the number of sub-groups specified by the kernel, or 0 (if not specified). |
-| `info::kernel_sub_group::compile_sub_group_size` | N/A | `size_t` | Returns the required sub-group size specified by the kernel, or 0 (if not specified). |
-
-# Using Subgroups within NDRange Kernels
-
-The `sub_group` class encapsulates all functionality required to represent a particular sub-group within a parallel execution.  It is not user-constructable, and can only be accessed via the `nd_item` class.
-
-|Member functions|Description|
-|----------------|-----------|
-| `sub_group get_sub_group() const` | Return the sub-group to which the work-item belongs. |
-
-An example usage of the `sub_group` class is given below.
-
- ```c++
-parallel_for<class kernel>(..., [&](nd_item item)
-{
-  sub_group sg = item.get_sub_group();
-  for (int v = sg.get_local_id(); v < N; v += sg.get_local_range())
-  {
-    ...
-  }
-});
- ```
-
-# Sub-group Functions
-
-With the exception of the common interface members, all member functions of the `sub_group` class are _sub-group functions_.  Sub-group functions synchronize all work-items in a sub-group (i.e. they act as sub-group barriers) and must therefore be encountered within converged control flow across all work-items in the sub-group.  All the work-items of a sub-group must execute the sub-group function before any are allowed to continue execution beyond the sub-group function.
-
-Each sub-group function applies only to the work-items within a single sub-group; communication between multiple sub-groups requires the use of work-group functions, or reads/writes from/to memory with appropriate work-group barriers and/or memory fences.
-
-The sub-group functions in this proposal have been identified as a core set of functions that should ideally be supported by all implementations and have a clear mapping to all devices.  The vast majority of these functions have an equivalent in other specifications (e.g. OpenCL, SPIR), and the semantics defined here are intended to be compatible.  Additional, highly specialized, sub-group functions should be relegated to vendor- or device-specific extensions.
-
-## Core Functionality
-
-### Common Member Functions
-
-The common member functions provide a mechanism for a developer to query properties of a sub-group and a work-item's position in it.
-
-|Member functions|Description|
-|----------------|-----------|
-| `id<1> get_local_id() const` | Return an id representing the index of the work-item within the sub-group. |
-| `range<1> get_local_range() const` | Return a SYCL range representing the number of work-items in the sub-group. |
-| `range<1> get_max_local_range() const` | Return a SYCL range representing the maximum number of work-items in any sub-group within the nd-range. |
-| `id<1> get_group_id() const` | Return an id representing the index of the sub-group within the work-group. |
-| `uint32_t get_group_range() const` | Return the number of sub-groups within the work-group. |
-| `uint32_t get_uniform_group_range() const` | Return the number of sub-groups per work-group in the uniform region of the nd-range. |
-
-### Synchronization Functions
-
-A sub-group barrier synchronizes all work-items in a sub-group, and orders memory operations to the specified address space(s).  On hardware where sub-groups are executed in SIMD, a sub-group barrier is expected to be a no-op.
-
-|Member functions|Description|
-|----------------|-----------|
-| `void barrier(access::fence_space accessSpace = access::fence_space::global_and_local) const;` | Execute a sub-group barrier with an optional memory fence specified by `accessSpace`. |
-
-### Vote / Ballot
-
-The vote / ballot sub-group functions communicate Boolean conditions between the work-items in a sub-group, and enable developers to direct control flow at the sub-group level: a work-item may take a branch if _any_ work-item in its sub-group would do so; or may exit a loop only once _all_ work-items in its sub-group have finished.
-
-|Member functions|Description|
-|----------------|-----------|
-| `bool any(bool predicate) const` | Return `true` if `predicate` evaluates to `true` for any work-item in the sub-group. |
-| `bool all(bool predicate) const` | Return `true` if `predicate` evaluates to `true` for all work-items in the sub-group. |
-
-### Collectives
-
-The collective sub-group functions perform communications that involve all work-items in a sub-group, providing several common communication patterns: sharing a single value across the sub-group via a _broadcast_; combining all values from the sub-group into one value via a _reduction_; or performing a _scan_ across all values in the sub-group.
-
-The `plus`, `minimum` and `maximum` functors in the `cl::sycl` namespace correspond to the collective operations supported by OpenCL 2.0.  Supporting other operations (e.g. `minus` and `multiplies`) and user-defined functors may be of interest in the future.
-
-|Member functions|Description|
-|----------------|-----------|
-| `template <typename T>T broadcast(T x, id<1> local_id) const` | Broadcast the value of `x` from the work-item with the specified id to all work-items within the sub-group. The value of `local_id` must be the same for all work-items in the sub-group. |
-| `template <typename T, class BinaryOp>T reduce(T x, BinaryOp binary_op) const` | Combine the values of `x` from all work-items in the sub-group using the specified operator, which must be one of: `plus`, `minimum` or `maximum`. |
-| `template <typename T, class BinaryOp>T reduce(T x, T init, BinaryOp binary_op) const` | Combine the values of `x` from all work-items in the sub-group using an initial value of `init` and the specified operator, which must be one of: `plus`, `minimum` or `maximum`. |
-| `template <typename T, class BinaryOp>T exclusive_scan(T x, BinaryOp binary_op) const` | Perform an exclusive scan over the values of `x` from all work-items in the sub-group using the specified operator, which must be one of: `plus`, `minimum` or `maximum`. The value returned on work-item `i` is the exclusive scan of the first `i` work-items in the sub-group. The initial value is the identity value of the operator. |
-| `template <typename T, class BinaryOp>T exclusive_scan(T x, T init, BinaryOp binary_op) const` | Perform an exclusive scan over the values of `x` from all work-items in the sub-group using the specified operator, which must be one of: `plus`, `minimum` or `maximum`. The value returned on work-item `i` is the exclusive scan of the first `i` work-items in the sub-group. The initial value is specified by `init`. |
-| `template <typename T, class BinaryOp>T inclusive_scan(T x, BinaryOp binary_op) const` | Perform an inclusive scan over the values of `x` from all work-items in the sub-group using the specified operator, which must be one of: `plus`, `minimum` or `maximum`. The value returned on work-item `i` is the inclusive scan of the first `i` work-items in the sub-group. |
-| `template <typename T, class BinaryOp>T inclusive_scan(T x, BinaryOp binary_op, T init) const` | Perform an inclusive scan over the values of `x` from all work-items in the sub-group using the specified operator, which must be one of: `plus`, `minimum` or `maximum`. The value returned on work-item `i` is the inclusive scan of the initial value `init` and the first `i` work-items in the sub-group. |
-
-## Extended Functionality
-
-### Shuffles
-
-The shuffle sub-group functions perform arbitrary communication between pairs of work-items in a sub-group.  Common patterns -- such as shifting all values in a sub-group by a fixed number of work-items -- are exposed as specialized shuffles that may be accelerated in hardware.
-
-|Member functions|Description|
-|----------------|-----------|
-| `template <typename T>T shuffle(T x, id<1> local_id) const` | Exchange values of `x` between work-items in the sub-group in an arbitrary pattern.  Returns the value of `x` from the work-item with the specified id.  The value of `local_id` must be between 0 and the sub-group size. |
-| `template <typename T>T shuffle_down(T x, uint32_t delta) const` | Exchange values of `x` between work-items in the sub-group via a shift.  Returns the value of `x` from the work-item whose id is `delta` larger than the calling work-item. The value returned when the result of id + `delta` is greater than or equal to the sub-group size is undefined. |
-| `template <typename T>T shuffle_up(T x, uint32_t delta) const` | Exchange values of `x` between work-items in the sub-group via a shift.  Returns the value of `x` from the work-item whose id is `delta` smaller than the calling work-item.  The value of returned when the result of id - `delta` is less than zero is undefined. |
-| `template <typename T>T shuffle_xor(T x, id<1> mask) const` | Exchange pairs of values of `x` between work-items in the sub-group.  Returns the value of `x` from the work-item whose id is equal to the exclusive-or of the calling work-item's id and `mask`. `mask` must be a compile-time constant value that is the same for all work-items in the sub-group. |
-
-### Two-Input Shuffles
-
-This proposal makes a distinction between shuffles with one input per work-item and shuffles with two inputs per work-item.  The two-input versions map naturally to SIMD execution (see the `shuffle2` vector operation from OpenCL), and enable developers to avoid certain undefined behaviors from the one-input versions.  The simplest way to think of the two-input shuffles is that their operation is equivalent to a one-input shuffle on a virtual sub-group twice as big.
-
-|Member functions|Description|
-|----------------|-----------|
-| `template <typename T>T shuffle(T x, T y, id<1> local_id) const` | Exchange values of `x` and `y` between work-items in the sub-group in an arbitrary pattern.  If `local_id` is between 0 and the sub-group size, returns the value of `x` from the work-item with the specified id; if `local_id` is between the sub-group size and twice the sub-group size, returns the value of `y` from the work-item with the specified id (modulo the sub-group size).  The value of `local_id` must be between 0 and twice the sub-group size. |
-| `template <typename T>T shuffle_down(T x, T y, uint32_t delta) const` | Exchange values of `x` and `y` between work-items in the sub-group via a shift.  If the calling work-item's id + `delta` is between 0 and the sub-group size, returns the value of `x` from the work-item whose id is `delta` larger than the calling work-item; if the calling work-item's id + `delta` is between the sub-group size and twice the sub-group size, returns the value of `y` from the work-item with the specified id (modulo the sub-group size).  The value of `delta` must be less than the sub-group size. |
-| `template <typename T>T shuffle_up(T x, T y, uint32_t delta) const` | Exchange values of `x` and `y` between work-items in the sub-group via a shift.  If the calling work-item's id - `delta` is between 0 and the sub-group size, returns the value of `x` from the work-item whose id is `delta` smaller than the calling work-item; if the calling work-item's id - `delta` is between the sub-group size and twice the sub-group size, returns the value of `y` from the work-item with the specified id (modulo the sub-group size).  The value of `delta` must be less than the sub-group size. |
-
-### Loads / Stores
-
-The load and store sub-group functions enable developers to assert that all work-items in a sub-group read/write from/to contiguous locations in memory.  Such operations can be mapped directly to SIMD operations.
-
-|Member functions|Description|
-|----------------|-----------|
-| `template <typename T, access::address_space Space>T load(const multi_ptr<T,Space> src) const` | Load contiguous data from `src`.  Returns one element per work-item, corresponding to the memory location at `src` + `get_local_id()`. The value of `src` must be the same for all work-items in the sub-group. |
-| `template <int N, typename T, access::address_space Space>vec<T,N> load(const multi_ptr<T,Space> src) const` | Load contiguous data from `src`.  Returns `N` elements per work-item, corresponding to the `N` memory locations at `src` + `i` * `get_max_local_range()` + `get_local_id()` for `i` between 0 and `N`. The value of `src` must be the same for all work-items in the sub-group. |
-| `template <typename T, access::address_space Space>void store(multi_ptr<T,Space> dst, const T& x) const` | Store contiguous data to `dst`.  The value of `x` from each work-item is written to the memory location at `dst` + `get_local_id()`. The value of `dst` must be the same for all work-items in the sub-group. |
-| `template <int N, typename T, access::address_space Space>void store(multi_ptr<T,Space> dst, const vec<T,N>& x) const` | Store contiguous data to `dst`.  The `N` elements from each work-item are written to the memory locations at `dst` + `i` * `get_max_local_range()` + `get_local_id()` for `i` between 0 and `N`.  The value of `dst` must be the same for all work-items in the sub-group. |
-
-# Sample Header
-
-```c++
-namespace cl {
-namespace sycl {
-namespace intel {
-struct sub_group {
-
-    /* --- common interface members --- */
-
-    id<1> get_local_id() const;
-
-    range<1> get_local_range() const;
-
-    range<1> get_max_local_range() const;
-
-    id<1> get_group_id() const;
-
-    uint32_t get_group_range() const;
-
-    uint32_t get_uniform_group_range() const;
-
-    /* --- vote/ballot functions --- */
-
-    bool any(bool predicate) const;
-
-    bool all(bool predicate) const;
-
-    /* --- data-sharing --- */
-
-    template <typename T>
-    T broadcast(T x, id<1> local_id) const;
-
-    template <typename T, class BinaryOp>
-    T reduce(T x, BinaryOp binary_op) const;
-
-    template <typename T, class BinaryOp>
-    T reduce(T x, T init, BinaryOp binary_op) const;
-
-    template <typename T, class BinaryOp>
-    T exclusive_scan(T x, BinaryOp binary_op) const;
-
-    template <typename T, class BinaryOp>
-    T exclusive_scan(T x, T init, BinaryOp binary_op) const;
-
-    template <typename T, class BinaryOp>
-    T inclusive_scan(T x, BinaryOp binary_op) const;
-
-    template <typename T, class BinaryOp>
-    T inclusive_scan(T x, BinaryOp binary_op, T init) const;
-
-    /* --- one-input shuffles --- */
-
-    template <typename T>
-    T shuffle(T x, id<1> local_id) const;
-
-    template <typename T>
-    T shuffle_down(T x, uint32_t delta) const;
-
-    template <typename T>
-    T shuffle_up(T x, uint32_t delta) const;
-
-    template <typename T>
-    T shuffle_xor(T x, id<1> value) const;
-
-    /* --- two-input shuffles --- */
-
-    template <typename T>
-    T shuffle(T x, T y, id<1> local_id) const;
-
-    template <typename T>
-    T shuffle_down(T current, T next, uint32_t delta) const;
-
-    template <typename T>
-    T shuffle_up(T previous, T current, uint32_t delta) const;
-
-    /* --- sub-group load/stores --- */
-
-    template <typename T, access::address_space Space>
-    T load(const multi_ptr<T,Space> src) const;
-
-    template <typename T, int N, access::address_space Space>
-    vec<T,N> load(const multi_ptr<T,Space> src) const;
-
-    template <typename T, int N, access::address_space Space>
-    void store(multi_ptr<T,Space> dst, const T& x) const;
-
-    template <typename T, int N, access::address_space Space>
-    void store(multi_ptr<T,Space> dst, const vec<T,N>& x) const;
-
-};
-} // intel
-} // sycl
-} // cl
-```

From b4e84cf0d17b359630b8fe4294a38f9618bbc7ba Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Mon, 16 Mar 2020 09:45:35 -0700
Subject: [PATCH 02/14] [SYCL][Doc] Revision v1.2.1-6 to Revision 6

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc    | 2 +-
 sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc      | 2 +-
 .../SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc b/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
index 0a1cc753ae60c..9b5d2784c53db 100755
--- a/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
+++ b/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
@@ -51,7 +51,7 @@ John Pennycook, Intel (john 'dot' pennycook 'at' intel 'dot' com)
 
 == Dependencies
 
-This extension is written against the SYCL 1.2.1 specification, Revision v1.2.1-6 and the following extensions:
+This extension is written against the SYCL 1.2.1 specification, Revision 6 and the following extensions:
 
 - +SYCL_INTEL_sub_group+
 
diff --git a/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
index 2f463ecfa1959..4a3edc41d2791 100755
--- a/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
+++ b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
@@ -51,7 +51,7 @@ John Pennycook, Intel (john 'dot' pennycook 'at' intel 'dot' com)
 
 == Dependencies
 
-This extension is written against the SYCL 1.2.1 specification, Revision v1.2.1-6.
+This extension is written against the SYCL 1.2.1 specification, Revision 6.
 
 == Overview
 
diff --git a/sycl/doc/extensions/SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc b/sycl/doc/extensions/SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc
index ab998730bc597..3865a08fb5731 100755
--- a/sycl/doc/extensions/SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc
+++ b/sycl/doc/extensions/SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc
@@ -51,7 +51,7 @@ John Pennycook, Intel (john 'dot' pennycook 'at' intel 'dot' com)
 
 == Dependencies
 
-This extension is written against the SYCL 1.2.1 specification, Revision v1.2.1-6 and the following extensions:
+This extension is written against the SYCL 1.2.1 specification, Revision 6 and the following extensions:
 
 - +SYCL_INTEL_group_algorithms+
 - +SYCL_INTEL_sub_group+

From 3abdfa58ec8d16b91b00b9bc5b52a005213eb560 Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Mon, 16 Mar 2020 09:46:48 -0700
Subject: [PATCH 03/14] [SYCL][Doc] Fix copyright year

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
index 4a3edc41d2791..659050b1b8cf8 100755
--- a/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
+++ b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
@@ -31,7 +31,7 @@ This document describes an extension which introduces a new +sub_group+ class re
 
 == Notice
 
-Copyright (c) 2020 Intel Corporation.  All rights reserved.
+Copyright (c) 2019-2020 Intel Corporation.  All rights reserved.
 
 == Status
 

From e7dc7c23e908a084f2a026cba2008cf072be86c2 Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Mon, 16 Mar 2020 11:10:01 -0700
Subject: [PATCH 04/14] [SYCL][Doc] Point from old extension to new

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md | 5 +++++
 1 file changed, 5 insertions(+)
 create mode 100644 sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md

diff --git a/sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md b/sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md
new file mode 100644
index 0000000000000..0b6ac4304431e
--- /dev/null
+++ b/sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md
@@ -0,0 +1,5 @@
+The sub-group extension has been split into two extensions, which can be found below:
+- [SubGroup](../SubGroup/)
+- [SubGroupAlgorithms](../SubGroupAlgorithms/)
+
+Some existing sub-group functionality has been deprecated as part of this change, and will be removed in a future version of the compiler.

From 16ae2ffa164064f8c52d6de7ebea78722df8aa4e Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Mon, 16 Mar 2020 11:14:08 -0700
Subject: [PATCH 05/14] [SYCL][Doc] Link to archived sub-group extension

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md b/sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md
index 0b6ac4304431e..ab73bb225653b 100644
--- a/sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md
+++ b/sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md
@@ -2,4 +2,6 @@ The sub-group extension has been split into two extensions, which can be found b
 - [SubGroup](../SubGroup/)
 - [SubGroupAlgorithms](../SubGroupAlgorithms/)
 
+The previous extension is archived [here](https://github.com/intel/llvm/blob/fba2e0602550a86c74149d9875b788ad1117f8d3/sycl/doc/extensions/SubGroupNDRange/SubGroupNDRange.md).
+
 Some existing sub-group functionality has been deprecated as part of this change, and will be removed in a future version of the compiler.

From 807dc5bab13b193d33a7877b657805cf269b0578 Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Mon, 16 Mar 2020 11:41:56 -0700
Subject: [PATCH 06/14] [SYCL][Doc] Fix typo in GroupMask

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc b/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
index 9b5d2784c53db..cede3cfd90885 100755
--- a/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
+++ b/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
@@ -117,7 +117,7 @@ The mask is defined such that the least significant bit (LSB) corresponds to the
 |Return the highest +id+ with a corresponding bit set in the mask. If no bits are set, the return value is equal to `size()`.
 
 |+template <typename T = vec<uint32_t,4>> void insert_bits(T bits, id<1> pos = 0)+
-|Insert `CHAR_BIT * sizeof(T)` bits into the mask, starting from _pos_.  `T` must be an integral type of a SYCL vector of integral types.  _pos_ must be a multiple of `CHAR_BIT * sizeof(T)` in the range [0, `size()`).  If _pos_ + `CHAR_BIT * sizeof(T)` is greater than `size()`, the final `size()` - (_pos_ + `CHAR_BIT * sizeof(T)`) bits are ignored.
+|Insert `CHAR_BIT * sizeof(T)` bits into the mask, starting from _pos_.  `T` must be an integral type or a SYCL vector of integral types.  _pos_ must be a multiple of `CHAR_BIT * sizeof(T)` in the range [0, `size()`).  If _pos_ + `CHAR_BIT * sizeof(T)` is greater than `size()`, the final `size()` - (_pos_ + `CHAR_BIT * sizeof(T)`) bits are ignored.
 
 |+template <typename T = vec<uint32_t,4>> T extract_bits(id<1> pos = 0) const+
 |Return `CHAR_BIT * sizeof(T)` bits from the mask, starting from _pos_.  `T` must be an integral type or a SYCL vector of integral types.  _pos_ must be a multiple of `CHAR_BIT * sizeof(T)` in the range [0, `size()`).  If _pos_ + `CHAR_BIT * sizeof(T)` is greater than `size()`, the final `size()` - (_pos_ + `CHAR_BIT * sizeof(T)`) bits of the return value are zero.

From ff98d10da90954dca23bb110231692f9cc5b8294 Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Mon, 16 Mar 2020 11:43:01 -0700
Subject: [PATCH 07/14] [SYCL][Doc] Space template must be local or global

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 .../SYCL_INTEL_sub_group_algorithms.asciidoc              | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/sycl/doc/extensions/SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc b/sycl/doc/extensions/SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc
index 3865a08fb5731..af7ed1771f393 100755
--- a/sycl/doc/extensions/SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc
+++ b/sycl/doc/extensions/SubGroupAlgorithms/SYCL_INTEL_sub_group_algorithms.asciidoc
@@ -135,16 +135,16 @@ The load and store sub-group functions enable developers to assert that all work
 |Function|Description
 
 |+template <typename T, access::address_space Space> T load(sub_group sg, const multi_ptr<T,Space> src)+
-|Load contiguous data from _src_.  Returns one element per work-item, corresponding to the memory location at _src_ + +get_local_id()+. The value of _src_ must be the same for all work-items in the sub-group.
+|Load contiguous data from _src_.  Returns one element per work-item, corresponding to the memory location at _src_ + +get_local_id()+. The value of _src_ must be the same for all work-items in the sub-group.  _Space_ must be +access::address_space::global_space+ or +access::address_space::local_space+.
 
 |+template <int N, typename T, access::address_space Space> vec<T,N> load(sub_group sg, const multi_ptr<T,Space> src)+
-|Load contiguous data from _src_.  Returns _N_ elements per work-item, corresponding to the _N_ memory locations at _src_ + +i+ * +get_max_local_range()+ + +get_local_id()+ for +i+ between 0 and _N_. The value of _src_ must be the same for all work-items in the sub-group.
+|Load contiguous data from _src_.  Returns _N_ elements per work-item, corresponding to the _N_ memory locations at _src_ + +i+ * +get_max_local_range()+ + +get_local_id()+ for +i+ between 0 and _N_. The value of _src_ must be the same for all work-items in the sub-group.  _Space_ must be +access::address_space::global_space+ or +access::address_space::local_space+.
 
 |+template <typename T, access::address_space Space> void store(sub_group sg, multi_ptr<T,Space> dst, const T& x)+
-|Store contiguous data to _dst_.  The value of _x_ from each work-item is written to the memory location at _dst_ + +get_local_id()+. The value of _dst_ must be the same for all work-items in the sub-group.
+|Store contiguous data to _dst_.  The value of _x_ from each work-item is written to the memory location at _dst_ + +get_local_id()+. The value of _dst_ must be the same for all work-items in the sub-group.  _Space_ must be +access::address_space::global_space+ or +access::address_space::local_space+.
 
 |+template <int N, typename T, access::address_space Space> void store(sub_group sg, multi_ptr<T,Space> dst, const vec<T,N>& x)+
-|Store contiguous data to _dst_.  The _N_ elements from each work-item are written to the memory locations at _dst_ + +i+ * +get_max_local_range()+ + +get_local_id()+ for +i+ between 0 and _N_.  The value of _dst_ must be the same for all work-items in the sub-group. |
+|Store contiguous data to _dst_.  The _N_ elements from each work-item are written to the memory locations at _dst_ + +i+ * +get_max_local_range()+ + +get_local_id()+ for +i+ between 0 and _N_.  The value of _dst_ must be the same for all work-items in the sub-group.  _Space_ must be +access::address_space::global_space+ or +access::address_space::local_space+.
 |===
 
 == Issues

From 039050afbd1db23961d3665d28573f922cf40980 Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Mon, 16 Mar 2020 11:57:35 -0700
Subject: [PATCH 08/14] [SYCL][Doc] Specify class before member functions

Text before tables now identifies which class is being discussed.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 .../extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc  | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
index 659050b1b8cf8..53163127361fc 100755
--- a/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
+++ b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
@@ -92,6 +92,8 @@ Several aspects of sub-group functionality are implementation-defined: the size
 
 To maximize portability across devices, developers should not assume that work-items within a sub-group execute in lockstep, nor that two sub-groups within a work-group will make independent forward progress with respect to one another.
 
+The device descriptors below are added to the +info::device+ enumeration class:
+
 |===
 |Device Descriptors|Return Type|Description
 
@@ -108,6 +110,8 @@ To maximize portability across devices, developers should not assume that work-i
 |Returns a vector_class of +size_t+ containing the set of sub-group sizes supported by the device.
 |===
 
+An additional query for sub-group information is added to the +kernel+ class:
+
 |===
 |Member Functions|Description
 
@@ -115,6 +119,8 @@ To maximize portability across devices, developers should not assume that work-i
 |Query information from the sub-group from a kernel using the +info::kernel_sub_group+ descriptor for a specific device.
 |===
 
+The kernel descriptors below are added as part of a new +info::kernel_sub_group+ enumeration class:
+
 |===
 |Kernel Descriptors|Return Type|Description
 
@@ -137,7 +143,9 @@ To maximize portability across devices, developers should not assume that work-i
 
 === The sub_group Class
 
-The +sub_group+ class encapsulates all functionality required to represent a particular sub-group within a parallel execution.  It has common by-value semantics and is not default or user-constructible, and can only be accessed via methods in the +nd_item+ class.
+The +sub_group+ class encapsulates all functionality required to represent a particular sub-group within a parallel execution.  It has common by-value semantics and is not default or user-constructible, and can only be accessed in ND-range kernels.
+
+To provide access to the +sub_group+ class, a new member function is added to the +nd_item+ class:
 
 |===
 |Member Functions|Description

From fae0eab1e3a8445bed276c57bf5fa71977d52791 Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Mon, 16 Mar 2020 12:09:43 -0700
Subject: [PATCH 09/14] [SYCL][Doc] Clarify sub-group dimensions and size

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
index 53163127361fc..3ff51291fb3a2 100755
--- a/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
+++ b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
@@ -88,7 +88,9 @@ It is illegal for a kernel or function to call a function with a mismatched sub-
 
 === Sub-group Queries
 
-Several aspects of sub-group functionality are implementation-defined: the size and number of sub-groups is implementation-defined (and may differ for each kernel); and different devices may make different guarantees with respect to how sub-groups within a work-group are scheduled.  Developers can query these behaviors at a device level and for individual kernels.
+Several aspects of sub-group functionality are implementation-defined: the size and number of sub-groups is implementation-defined (and may differ for each kernel); and different devices may make different guarantees with respect to how sub-groups within a work-group are scheduled.  Developers can query these behaviors at a device level and for individual kernels.  The sub-group size for a given combination of kernel and launch configuration is fixed, and guaranteed to be reflected by device and kernel queries.
+
+Each sub-group in a work-group is one-dimensional.  If the total number of work-items in a work-group is evenly divisible by the sub-group size, all sub-groups in the work-group will contain the same number of work-items.  If the total number of work-items in a work-group is not evenly divisible by the sub-group size, the number of work-items in the final sub-group is equal to the remainder of the total work-group size divided by the sub-group size.
 
 To maximize portability across devices, developers should not assume that work-items within a sub-group execute in lockstep, nor that two sub-groups within a work-group will make independent forward progress with respect to one another.
 

From 58b2f7de9448c1425456d4971ddb588cc23ce961 Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Mon, 16 Mar 2020 12:52:15 -0700
Subject: [PATCH 10/14] [SYCL][Doc] Clarify attribute propagation behavior

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
index 3ff51291fb3a2..ebb6fddf7d57b 100755
--- a/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
+++ b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
@@ -84,7 +84,7 @@ class Functor
 }
 ----
 
-It is illegal for a kernel or function to call a function with a mismatched sub-group size requirement, and the compiler should produce an error in this case.
+It is illegal for a kernel or function to call a function with a mismatched sub-group size requirement, and the compiler should produce an error in this case.  The +reqd_sub_group_size+ attribute is not propagated from a device function to callers of the function, and must be specified explicitly when a kernel is defined.
 
 === Sub-group Queries
 

From 99668d44ce8b5c79838d2de5ef9a3719566b656d Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Mon, 16 Mar 2020 13:56:49 -0700
Subject: [PATCH 11/14] [SYCL][Doc] Remove get_uniform_group_range()

Unclear what this means if non-uniform work-groups are not supported.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
index ebb6fddf7d57b..546bed21a0187 100755
--- a/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
+++ b/sycl/doc/extensions/SubGroup/SYCL_INTEL_sub_group.asciidoc
@@ -64,7 +64,6 @@ The first version of this document is focused on exposing sub-group functionalit
 Providing a generic group abstraction encapsulating the shared functionality of all synchronizable SYCL groups (i.e. work-groups and sub-groups) in a single interface would enable users to write more general code and simplify the introduction of additional SYCL groups in the future (e.g. device-wide synchronization groups).  Some names in this proposal are chosen to demonstrate how this may look:
 
 - The common interface members of +sub_group+ do not reference sub-groups by name, opting instead for generic names like +get_group_range()+.
-- +get_enqueued_num_sub_groups()+ is exposed as +get_uniform_group_range()+ since future generic groups may not be 'enqueued' but may still be non-uniform.
 - +sub_group+ defines a number of types and static members to simplify writing generic code.
 
 === Attributes
@@ -182,8 +181,8 @@ The member functions of the sub-group class provide a mechanism for a developer
 |+range<1> get_group_range() const+
 |Return a SYCL +range+ representing the number of sub-groups within the work-group.
 
-|+range<1> get_uniform_group_range() const+
-|Return a SYCL +range+ representing the number of sub-groups per work-group in the uniform region of the nd-range.
+|+range<1> get_max_group_range() const+
+|Return a SYCL +range+ representing the maximum number of sub-groups per work-group within the nd-range.
 |===
 
 An example usage of the +sub_group+ class is given below:
@@ -222,7 +221,6 @@ struct sub_group {
   id_type get_group_id() const;
   linear_id_type get_group_linear_id() const;
   range_type get_group_range() const;
-  range_type get_uniform_group_range() const;
 
 };
 } // intel

From 0b0c4741057678661bf3c95748c96a91d65e1dce Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Mon, 16 Mar 2020 14:21:02 -0700
Subject: [PATCH 12/14] [SYCL][Doc] Fix group_mask operator definitions

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 .../GroupMask/SYCL_INTEL_group_mask.asciidoc  | 67 ++++++++++++++++---
 1 file changed, 56 insertions(+), 11 deletions(-)

diff --git a/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc b/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
index cede3cfd90885..535a7d5aba765 100755
--- a/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
+++ b/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
@@ -146,6 +146,48 @@ The mask is defined such that the least significant bit (LSB) corresponds to the
 |+void flip(id<1> id)+
 |Toggle the value of the bit corresponding to the specified _id_.
 
+|+bool operator==(group_mask rhs) const+
+|Return true if each bit in this mask is equal to the corresponding bit in +rhs+.
+
+|+bool operator!=(group_mask rhs) const+
+|Return true if any bit in this mask is not equal to the corresponding bit in +rhs+.
+
+|+group_mask operator &=(group_mask rhs)+
+|Set the bits of this mask to the result of performing a bitwise AND with this mask and +rhs+.
+
+|+group_mask operator |=(group_mask rhs)+
+|Set the bits of this mask to the result of performing a bitwise OR with this mask and +rhs+.
+
+|+group_mask operator ^=(group_mask rhs)+
+|Set the bits of this mask to the result of performing a bitwise XOR with this mask and +rhs+.
+
+|+group_mask operator <<=(size_t shift)+
+|Set the bits of this mask to the result of shifting its bits _shift_ positions to the left.
+
+|+group_mask operator >>=(size_t shift)+
+|Set the bits of this mask to the result of shifting its bits _shift_ positions to the right.
+
+|+group_mask operator ~() const+
+|Return a mask representing the result of flipping all the bits in this mask.
+
+|+group_mask operator <<(size_t shift)+
+|Return a mask representing the result of shifting its bits _shift_ positions to the left.
+
+|+group_mask operator >>(size_t shift)+
+|Return a mask representing the result of shifting its bits _shift_ positions to the right.
+
+|===
+|Function|Description
+
+|+group_mask operator &(const group_mask& lhs, const group_mask& rhs)+
+|Return a mask representing the result of performing a bitwise AND of +lhs+ and +rhs+.
+
+|+group_mask operator |(const group_mask& lhs, const group_mask& rhs)+
+|Return a mask representing the result of performing a bitwise OR of +lhs+ and +rhs+.
+
+|+group_mask operator ^(const group_mask& lhs, const group_mask& rhs)+
+|Return a mask representing the result of performing a bitwise XOR of +lhs+ and +rhs+.
+
 |===
 
 ==== Sample Header
@@ -193,22 +235,25 @@ struct group_mask {
   void flip();
   void flip(id<1> id);
 
-  bool operator==(mask rhs) const;
-  bool operator!=(mask rhs) const;
+  bool operator==(group_mask rhs) const;
+  bool operator!=(group_mask rhs) const;
 
-  mask operator &=(mask rhs);
-  mask operator |=(mask rhs);
-  mask operator ^=(mask rhs);
-  mask operator ~() const;
-  mask operator <<=(mask rhs);
-  mask operator >>=(mask rhs);
+  group_mask operator &=(group_mask rhs);
+  group_mask operator |=(group_mask rhs);
+  group_mask operator ^=(group_mask rhs);
+  group_mask operator <<=(size_t);
+  group_mask operator >>=(size_t rhs);
 
-  mask operator &(mask rhs) const;
-  mask operator |(mask rhs) const;
-  mask operator ^(mask rhs) const;
+  group_mask operator ~() const;
+  group_mask operator <<(size_t) const;
+  group_mask operator >>(size_t) const;
 
 };
 
+group_mask operator &(const group_mask& lhs, const group_mask& rhs);
+group_mask operator |(const group_mask& lhs, const group_mask& rhs);
+group_mask operator ^(const group_mask& lhs, const group_mask& rhs);
+
 } // intel
 } // sycl
 } // cl

From b0825b0266a1b47c1b454232b78db17b8e077ca0 Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Mon, 16 Mar 2020 14:24:03 -0700
Subject: [PATCH 13/14] [SYCL][Doc] Fix table formatting

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc b/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
index 535a7d5aba765..4627abab17e3c 100755
--- a/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
+++ b/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
@@ -175,6 +175,7 @@ The mask is defined such that the least significant bit (LSB) corresponds to the
 
 |+group_mask operator >>(size_t shift)+
 |Return a mask representing the result of shifting its bits _shift_ positions to the right.
+|===
 
 |===
 |Function|Description

From 11a73fbdb383ca5931a20cbeab394737935c148b Mon Sep 17 00:00:00 2001
From: John Pennycook <john.pennycook@intel.com>
Date: Tue, 17 Mar 2020 07:31:23 -0700
Subject: [PATCH 14/14] [SYCL][Doc] << and >> perform logical shifts

Signed-off-by: John Pennycook <john.pennycook@intel.com>
---
 .../extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc   | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc b/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
index 4627abab17e3c..8c20ad11ae37d 100755
--- a/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
+++ b/sycl/doc/extensions/GroupMask/SYCL_INTEL_group_mask.asciidoc
@@ -162,19 +162,19 @@ The mask is defined such that the least significant bit (LSB) corresponds to the
 |Set the bits of this mask to the result of performing a bitwise XOR with this mask and +rhs+.
 
 |+group_mask operator <<=(size_t shift)+
-|Set the bits of this mask to the result of shifting its bits _shift_ positions to the left.
+|Set the bits of this mask to the result of shifting its bits _shift_ positions to the left using a logical shift.  Bits that are shifted out to the left are discarded, and zeroes are shifted in from the right.
 
 |+group_mask operator >>=(size_t shift)+
-|Set the bits of this mask to the result of shifting its bits _shift_ positions to the right.
+|Set the bits of this mask to the result of shifting its bits _shift_ positions to the right using a logical shift.  Bits that are shifted out to the right are discarded, and zeroes are shifted in from the left.
 
 |+group_mask operator ~() const+
 |Return a mask representing the result of flipping all the bits in this mask.
 
 |+group_mask operator <<(size_t shift)+
-|Return a mask representing the result of shifting its bits _shift_ positions to the left.
+|Return a mask representing the result of shifting its bits _shift_ positions to the left using a logical shift.  Bits that are shifted out to the left are discarded, and zeroes are shifted in from the right.
 
 |+group_mask operator >>(size_t shift)+
-|Return a mask representing the result of shifting its bits _shift_ positions to the right.
+|Return a mask representing the result of shifting its bits _shift_ positions to the right using a logical shift.  Bits that are shifted out to the right are discarded, and zeroes are shifted in from the left.
 |===
 
 |===