# When HPC meets Big Data in the Cloud

Prof. Cho-Li Wang The University of Hong Kong









Dec. 17, 2013 @Cloud-Asia

# Big Data: The "4Vs" Model

- High Volume (amount of data)
- High Velocity (speed of data in and out)
- **High Variety** (range of data types and sources)
- High Values : Most Important



WHAT IS



will be online, pushing the data created and shared to nearly **8 zettabytes.** 

Worldwide IP traffic will quadruple by 2015.



Everyday business and consumer life creates 2.5 quintillion bytes of data per day.



90% of the data in the world today has been created in the last two years alone.



**2010:** 800,000 petabytes (would fill a stack of DVDs reaching from the **earth to the moon** and back)

**By 2020**, that pile of DVDs would stretch **half way to Mars**.



#### • McKinsey Global Institute (MGI) :

- Using big data, retailers could increase its operating margin by more than **60%**.
- The U.S. could reduce its healthcare expenditure by 8%
- Government administrators in Europe could save more than €100 billion (\$143 billion).

## **Google Trend: 12/2013** Big Data vs. Data Analytics vs. Cloud Computing



# "Big Data" in 2013

## Outline

- Part I: Multi-granularity Computation Migration
  - "A Computation Migration Approach to Elasticity of Cloud Computing" (previous work)
- Part II: Big Data Computing on Future Maycore Chips
  - *C*rocodiles: <u>C</u>loud <u>R</u>untime with <u>O</u>bject <u>C</u>oherence <u>O</u>n <u>D</u>ynamic tILES for future 1000-core tiled processors" (ongoing)



# Part I

# Multi-granularity Computation Migration

**Source:** Cho-Li Wang, King Tin Lam and Ricky Ma, **"A Computation Migration Approach to Elasticity of Cloud Computing"**, Network and Traffic Engineering in Emerging Distributed Computing Applications, IGI Global, pp. 145-178, July, 2012.

## **Multi-granularity Computation Migration**



## (1) WAVNet: Live VM Migration over WAN

- A P2P Cloud with live VM migration over WAN
  - "Virtualized LAN" over the Internet"
- High penetration via NAT hole punching
  - Establish direct host-to-host connection
  - Free from proxies, able to traverse most NATs





Zheming Xu, Sheng Di, Weida Zhang, Luwei Cheng, and Cho-Li Wang, WAVNet: Wide-Area Network Virtualization Technique for Virtual Private Cloud, 2011 International Conference on Parallel Processing (<u>ICPP2011</u>)

**Key Members** 

# WAVNet: Live VM Migration over WAN • Experiments at Pacific Rim Areas



# StoryTelling@Home on WAVNet

# Key functions: story upload, story search, and listening online (streaming/downloading)

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Web<br>Application<br>Glassfish                               | Streaming<br>Server | Derby<br>Database |                     |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|---------------------|-------------------|---------------------|
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                               | VM (Xen)            |                   | Share Your<br>Story |
| StoryTelling@HOME<br>Philes first lagand<br>Californianee story: Additioner transles<br>List of recent stories                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Ditcoin<br>ACCEPTED HERE                                      |                     |                   |                     |
| 1.30      Path      Pathtab time      Line Stream        20      Eve      192.102.123.12M0. Initiamental.Log2      Wedvendig Apr 13, 14 00.00      Stream        20      Eve      192.102.123.12M0. Initiamental.Log2      Wedvendig Apr 13, 14 00.00      Stream        21      Eve      192.102.123.12M0. Initiamental.Log2      Wedvendig Apr 13, 14 00.00      Stream        21      Eve      192.102.122.17M0.est Pathtab      Wedvendig Apr 13, 14 00.00      Stream        21      Eve      192.102.122.17M0.est Pathtab      Wedvendig Apr 13, 14 00.00      Stream        21      Eve      192.102.122.17M0.est Pathtab      Wedvendig Apr 13, 14 00.00      Stream        21      Eve      192.102.122.17M0.est Pathtab      Wedvendig Apr 13, 14 00.00      Stream | antia des contentada Content<br>activa des contentada Content |                     |                   |                     |
| 1.30      Path      Pathbol time      Use Street        still Aather      102 (108 123 170/01 hotsevental 3 rep3      Wedwendes Ayr 13, 04 00/01      Construction        20      Enc      102 (108 123 170/01 hotsevental 3 rep3      Wedwendes Ayr 13, 04 00/00      Construction        21      Enc      102 (108 123 170/01 hotsevental 3 rep3      Wedwendes Ayr 13, 04 00/00      Construction        21      Enc      102 (108 122 170/01 hotsevental 5 rep3      Wedwendes Ayr 13, 04 00/00      Construction                                                                                                                                                                                                                                                         | Constant<br>Constant                                          |                     |                   |                     |

Prototyped by Eric Wu, Queena Fung



## **History and Roadmap of JESSICA Project**

#### • JESSICA V1.0 (1996-1999)

- Execution mode: Interpreter Mode
- JVM kernel modification (Kaffe JVM)
- Global heap: built on top of TreadMarks (Lazy Release Consistency + homeless)

#### • JESSICA V2.0 (2000-2006)

- Execution mode: **JIT-Compiler Mode**
- JVM kernel modification
- Lazy release consistency + migrating-home protocol
- JESSICA V3.0 (2008~2010)
  - Built above JVM (via JVMTI)
  - Support Large Object Space
- JESSICA v.4 (2010~)
  - Japonica : Automatic loop parallization and speculative execution on GPU and multicore CPU. Handle dynamic loops with runtime dependency checking

#### **Download JESSICA2:**

http://i.cs.hku.hk/~clwang/projects/JESSICA2.html



#### **Past Members**





King Tin LAM,

Chenggang Zhang





Kinson Chan

Ricky Ma

#### (3) Stack Migration: "Stack-on-Demand" (SOD)





#### **SoD enabled the "Software Defined" Execution Model**



With such flexible or *composable* execution paths, SOD enables agile and elastic exploitation of distributed resources (storage) → Exploit Data Locality in Big Data Computing !

## **SOD : Face Detection on Cloud**



| apps       | capture time | transfer time | restore time | total migration |  |
|------------|--------------|---------------|--------------|-----------------|--|
|            | (ms)         | (ms)          | (ms)         | latency (ms)    |  |
| FaceDetect | 103          | 155           | 7            | 265             |  |



#### **SOD: "Mobile Spider" on iPhone**

| Bandwidth | Capture   | Transfer  | Restore   | Migration |
|-----------|-----------|-----------|-----------|-----------|
| (kbps)    | time (ms) | time (ms) | time (ms) | time (ms) |
| 50        | 14        | 1674      | 40        | 1729      |
| 128       | 13        | 1194      | 50        | 1040      |
| 384       | 14        | 728       | 29        | 772       |
| 764       | 14        | 672       | 31        | 717       |

Size of class file and state data = 8255 bytes

(with Wi-Fi connection)



#### A photo sharing **Cloud** service





Ricky K. K. Ma, King Tin Lam, Cho-Li Wang, "eXCloud: Transparent Runtime Support for Scaling Mobile Applications," 2011 IEEE International Conference on Cloud and Service Computing (CSC2011) (Best Paper Award)

#### **"JESSICA on Cloud": VM Migration + Thread Migration**



### **Comparison of Migration Overhead** Migration overhead (MO)

= execution time w/ migration – execution time w/o migration

| Sys |                  |         |      |                     | JESSICA2 on Xen<br>(Thread mig.) |                  | G-JavaMPI on Xen<br>(Process mig.) |         |                  | JDK on Xen<br>(VM live mig.) |         |      |
|-----|------------------|---------|------|---------------------|----------------------------------|------------------|------------------------------------|---------|------------------|------------------------------|---------|------|
|     | Exec. time (sec) |         | МО   | Exec. time (sec) MO |                                  | Exec. time (sec) |                                    | МО      | Exec. time (sec) |                              | МО      |      |
| Арр | w/ mig           | w/o mig | (ms) | w/ mig              | w/o mig                          | (ms)             | w/ mig                             | w/o mig | (ms)             | w/ mig                       | w/o mig | (ms) |
| Fib | 12.77            | 12.69   | 83   | 47.31               | 47.21                            | 96               | 16.45                              | 12.68   | 3770             | 13.37                        | 12.28   | 1090 |
| NQ  | 7.72             | 7.67    | 49   | 37.49               | 37.30                            | 193              | 7.93                               | 7.63    | 299              | 8.36                         | 7.15    | 1210 |
| TSP | 3.59             | 3.58    | 13   | 19.54               | 19.44                            | 96               | 3.67                               | 3.59    | 84               | 4.76                         | 3.54    | 1220 |
| FFT | 10.79            | 10.60   | 194  | 253.63              | 250.19                           | 3436             | 15.13                              | 10.75   | 4379             | 12.94                        | 10.15   | 2790 |

SOD has the smallest migration overhead : ranges from 13ms to 194ms under Gigabit Ethernet

#### Frame (SOD): Thread : Process : VM = 1 : 3 : 10 : 150





# **Part II**

### **Big Data Computing on Future Maycore Chips**



*C*rocodiles: <u>C</u>loud <u>R</u>untime with <u>O</u>bject <u>C</u>oherence <u>On</u> <u>D</u>ynamic tILES for future 1000-core tiled processors" (1/2013-12/2015, HK GRF)

# The 1000-Core Era

- Experts predict that by the end of the decade we could have as many as 1000 cores on a single die (S. Borkar, "Thousand core chips: a technology perspective")
- International Technology Roadmap for Semiconductors (ITRS) 2011 forecast:
  - By 2015: <u>450 cores</u>
  - By 2020: <u>1500 cores</u>
- Why 1000-core chip ?
  - Densely packed servers cluster  $\rightarrow$  Cloud Data Center in a Chip
  - Space efficiency + Power Efficiency (Greener)



## **Tiled Manycore Architectures**

| Micro-<br>architecture        | # of cores          | On-Chip<br>Network (Link<br>Bandwidth) | H/W<br>Coherence       | L1\$/core    | L2\$/core            | L3\$             | # of DDR<br>Controller |
|-------------------------------|---------------------|----------------------------------------|------------------------|--------------|----------------------|------------------|------------------------|
| Teraflops<br>Research Chip    | 80 (4.0<br>GHz)     | 2D Mesh<br>(256Gb/s)                   | No                     | 5KB          | 256KB                | NA               | 3D stacked<br>memory   |
| MIT's ATAC<br>(2008)          | 1000<br>(simulated) | 2D (optical)<br>Mesh                   | Yes                    | NA           | NA                   | NA               | NA                     |
| MIT EM <sup>2</sup><br>(2013) | 110                 | 2D Mesh                                | Simplified<br>(1 copy) | 8KB+32<br>KB | NA                   | NA               | 2                      |
| Single-Chip<br>Cloud (2009)   | 48 (1.0 GHz)        | 2D Mesh<br>(512Gb/s)                   | No                     | 32KB         | (256 + 8)<br>KB MPB  | Nil              | 4                      |
| Tilera<br>Tile-GX (2009)      | 100 (1.5<br>GHz)    | 2D Mesh<br>(320Gb/s)                   | Yes                    | 64KB         | 256KB                | 26MB<br>(shared) | 4                      |
| Godson-T<br>(FPGA, 2011)      | 64 (1.0 GHz)        | 2D Mesh                                | Yes                    | 32KB         | 128KB x<br>16 shared | Nil              | 4                      |

All adopted tile-based architecture: Cores are connected through a 2D network-on-a-chip

### **Tiled Manycore Architectures**



Tilera Tile-Gx100 (100 64-bit cores)



Intel's 48-core Single-chip Cloud Computer (SCC)



#### Adapteva's Parallella: 64 cores for \$199



Intel Knights Landing processor (2014/15)

## The Software Crisis in 1000-core Era

## □ New Challenges

- 1. "Off-chip Memory Wall" Problem
- 2. "Coherency Wall" Problem
- 3. "Power Wall" Problem
- Moving towards a parallelism with 1,000 cores requires a fairly radical rethinking of how to design system software.
- What we have done:
  - Developed a scalable OS-assisted shared virtual memory (SVM) system on a multikernel OS (Barrelfish) on the Intel Single-chip Cloud Computer (SCC) which represents a likely future norm of many-core noncache-coherent NUMA (NCC-NUMA) processor.
  - A "zone-based" dynamic voltage and frequency scaling (DVFS) method for power saving

### (1) "Off-chip Memory Wall" Problem

<sup>o</sup> DRAM performance (latency) improved slowly over the past 40 years.



(a) Gap of DRAM Density & Speed

(b) DRAM Latency Not Improved

Memory density has doubled nearly every two years, while performance has improved slowly  $\rightarrow$  Eliminating most of the benefits of processor improvement

Source: Shekhar Borkar, Andrew A. Chien, "The Future of Microprocessors", Communications of ACM, Vol. 54 No. 5, Pages 67-77, May 2011.

## (1) "Off-chip Memory Wall" Problem

#### <u>Smaller per-core DRAM bandwidth</u>

- Intel SCC : only 4 DDR3 memory controllers → not scale with the increasing core density
- o 3D stacked memory (TSV technology) helps ?



## New Design Strategies (Memory-Wall)

Data locality (working set) getting more critical!

- Stop multitasking
  - Context switching breaks data locality
  - <u>Space Sharing</u> instead of <u>Time Sharing</u>
- "NoVM" : (or Space-sharing VM)
  - No support of VM because of weaker cores (1.0-1.5 GHz)
  - "Space Sharing" as we have many cores.

• Others

- Maximize the use of on-chip memory (e.g., MPB in SCC)
- Compiler or runtime techniques to <u>improve data reuse</u> (or increase arithmetic intensity) → temporal locality becomes more critical

## (2): "Coherency Wall" Problem

Overhead of enforcing cache coherency across 1,000 cores at hardware level will put a hard limit on <u>scalability:</u>

- 1. **<u>Die space overhead</u>**:
  - Cache directory; read/write log increase
  - **200%** overhead for storing the 1000 presence-bits →128byte directory vs. 64-byte cache line
- 2. <u>Performance overhead</u>:
  - **20% more traffic** per miss than a system with caches but not coherence (e.g., locate other copies at hierarchical directories; issue invalidations to ALL sharers)
- 3. Not always needed:
  - Only **10% of the application** memory references actually require cache coherence tracking (Nilsson, 2003)

Why On-Chip Cache Coherence is Here to Stay

http://research.cs.wisc.edu/multifacet/papers/cacm2012\_coherence\_nearfinal.pdf

## (2): "Coherency Wall" Problem

- 4. Verification complexity and extensibility:
  - Multiple copies AND multiple paths through network → require to avoid deadlock, livelock, starvation due to subtle races and many transient states.
- 5. <u>Energy overhead</u>:
  - **Unnecessary data movement** and **replication** consumes extra energy consumption on network and cache resources (Kurian ISCA13);
  - Snoop-related cache activities can contribute up to **40%** of the total cache power (Ekman 2002, Loghi 2005)

**Intel's SCC and Teraflops Research Chip decided to give up coherent caches.** (History repeats itself : NCC-NUMA in 1990s: Cray T3D/ T3E)

## New Design Strategies ("Coherency Wall")

- Software-managed cache coherence
  - Leverage programmable
    on-chip memory (e.g., MPB
    on Intel SCC)
- Scope consistency (ScC) : minimizing on-chip network and off-chip DRAM traffic
- Migrating-home ScC
  Protocol (MH-ScC) →
  improve data locality



Before HomeMigratingAfter Home migrationMigrationphaseAfter Home migrationSimulation results obtained in a 8-node cluster (SOR program)

## (3) "Power Wall" Problem

• Computation costs much less energy than moving data to and from the computation units

10% of the operands move over the network (10 hops at 0.06pJ/bit)  $\rightarrow$  35 watts of power  $\rightarrow$  over 50% of the processor's power budget.

Bill Dally, Chief Scientist of nVIDIA

1 pJ for an integer operation

6UUX

- **20** pJ for a floating-point operation
- 1000X<sub>o</sub> 26 pJ to move an operand over 1mm of wire to local memory

**1 nJ** to read an operand from on-chip memory located at the far end of a chip

**16 nJ** to read an operand from off-chip DRAM



On-die network energy consumption per bit

picojoule (pJ) =  $10^{-12}$  J nanojoule (nJ) =  $10^{-9}$  J

## New Design Strategies ("Power Wall")

## • Stop moving so much data around

- o Data Locality (Working Set) still critical!
- Distance-aware cache placement and home migration
- Migrating "code & state" instead of data
  - Relatively easier in shared memory manycore systems.
  - MIT EM<sup>2</sup> support hardware thread migration
- Adopt multikernel operating system (e.g., Barrelfish)
  - Message passing among kernels to avoid unnecessary NoC traffic



• Barrelfish : "Compact message cheaper than many cache lines-- even on a cache-coherent machine."

**Crocodiles**: <u>C</u>loud <u>R</u>untime with <u>O</u>bject <u>C</u>oherence <u>O</u>n <u>D</u>ynamic tILES for future 1000-core tiled processors" (HK GRF: 01/2013-12/2015)





#### **Current Platform: INTEL 48-core SCC processor**



Routers (**R**), memory controllers (**MC**), mesh interface units (**MIU**), cache controllers (**CC**), front side bus (**FSB**). 45 nm CMOS technology, 1.3 billion transistors.

## (1) Cloud-on-Chip (CoC) Paradigm

- Condensing a data center into a single many-core chip
  - <u>"Zoning"</u>(Spatial Partition)
  - Multiple isolated zones  $\rightarrow$  Better performance isolation
  - Mimic multitenant Cloud computing without timesharing VMs → avoid context switching



## (2) Dynamic Zoning

- <u>"Dynamic Zone Scaling</u>":
  - Partitioning varies over time.
  - On-demand scaling of resources (e.g., # of cores, DRAM,..) for each zone.

Space

Space

 Fit well with the domain-based power management (e.g., Intel SCC)





#### (3) Software-Managed Cache Coherence: JumpSCC

- Leverage programmable on-chip memory (e.g., MPB on Intel SCC)
- <u>Scope Consistency (ScC)</u>
  : minimizing on-chip network and off-chip
   DRAM traffic
  - Existing systems using ScC: Jiajia (1998),
     Nautilus (1998), HKU JUMP (1999), HKU
     LOTS (2004), Godson-T (2009).



# **Intel SCC**

#### **JumpSCC: Hybrid Modes of Memory Sharing**

- Data can be shared in a different way.
  - Selectable on per-page basis
- <u>Two modes available:</u>
  - 1. Shared Physical Memory (SPM)
    - Intel SMC's way
    - All data kept as golden pages in shared DRAM
    - Set MPBT to **bypass L2 cache**.
    - Use write-through.
    - Use **CL1INVMB** and **flush WCB** to ensure consistency
  - 2. Distributed Shared Memory (DSM)
    - For each user core, it will copy the golden page to a cached copy in private DRAM upon page faults (due to memory protected).
    - Use *twin-and-diff* technique to avoid false sharing between multiple writers.



### A New Memory Type

- Message Passing Buffer Type (MPBT)
- **MPBT (or PMB)** is a bit in page table entry
  - $_{\circ}~$  We can map a chunk in off-chip DRAM as MPBT
  - We can map a chunk in on-chip MPB as non-MPBT
  - We can modify it at runtime
- MPBT tag only takes effect upon
  - L1\$ write miss (where to write: WCB or L2\$)
  - L1\$ read miss (where to read: MEM or L2\$)
  - **CL1INVMB** (invalidate MPBT-tagged lines in L1\$)

Tile

То

Router

Traffic

MIU

Message

Passing

Gen

P54C

(16KB

each L1)

P54C FSB

P540

(16KB

each L1

256KB

L2

CC

CC

256KB

12



#### **SCC Cache Behavior - Normal**





#### **SCC Cache Behavior - UnCached**





#### **SCC Cache Behavior - MPBT**



#### Alleviate the Memory Wall Problem

- Minimize DRAM access by <u>exploiting the MPB space</u> to store data (<u>programmer-hinted</u> or <u>profiler-guided</u>)
- Now we have two datapaths:
  - ${}_{1} \qquad DRAM \rightarrow L\mathbf{2} \rightarrow L\mathbf{1}$
  - $\mathbf{2.} \qquad \mathbf{DRAM} \to \mathbf{MPB} \to \mathbf{L1}$
- Example uses of MPB:
  - Used to reduce *cache pollution*



- For sequential data access (data without reuse), manually allocate buffer in MPB and copy the data from off-die DRAM to MPB; then L2 cache won't evict any cache lines and keep the hottest data set.
- Used to cache data of "warm temperature"
  - Warm data (long reuse distance) is secondary to hot data (short one);
  - **Data of reuse distance** > **L2 capacity** can still be read within on-chip speed if read from MPB rather than from DRAM.

#### **JumpSCC: System Design**

- Built on top of Barrelfish OS as a user library
- Resemble traditional shared-memory programming
- Just a different set of malloc, lock, unlock (and barrier)





### Page Remapping for Data Locality

#### • Programmer hint API:

- JIA\_ATTR\_FREQ\_ACCESS: frequently accessed (read-write)
- JIA\_ATTR\_READONLY: frequently accessed (read-only)
- 。 JIA\_ATTR\_SINGLE\_WRITER: single writer
- JIA\_ATTR\_NOCACHE: non-cacheable (avoid cache pollution)
- System handling:
  - **JIA\_ATTR\_FREQ\_ACCESS**: copy golden page to **private DRAM**
  - JIA\_ATTR\_READONLY: set to non-MPBT (make use of large L2 \$)
  - JIA\_ATTR\_SINGLE\_WRITER:
    - The writer sets to non-MPBT R/W; readers set to MPBT R/O.
    - At sync pts, the writer flushes L2 cache by reading 256 KB data.
  - JIA\_ATTR\_NOCACHE: set PTE to non-cacheable

*Exploit L2 cache* (our protocol ensures no consistency issues)

# **Performance Benchmarking**

#### MalStone Benchmark

• Analysis of "drive-by exploits" in web site log files

#### Graph 500 Benchmark

• Generation, compression and breadth-first search of large graphs

- Sorting (bucket-sort kernel)
- Miscellaneous compute kernels (skipped)





# Graph 500

- On 48 cores, can reach **2.1x** gain in hybrid mode over SPM;
  - the BFS loop with certain data reuse (blacklist checks) contributes that.



# **Graph 500: Scalability Analysis**

- Step 3 (BFS) @ hybrid mode achieved 12x speedup.
- Step 3 @ SPM only **2.43**x.



#### **Performance Counter Analysis**



51

# **Bucket Sort**

- **12x** better in hybrid mode than SPM alone
- Superlinear speedup observed in hybrid mode
  - Augmented cache effect since L2\$ not bypassed



52

#### **Performance Counter Analysis**



# **Zone-based DVFS Method for Power Saving**



"Latency-aware Dynamic Voltage and Frequency Scaling on Many-core Architecture for Data-intensive Applications", CLOUDCOM-ASIA 2014

# **Significance of JumpSCC**

- The first SVM for the Barrelfish OS
- Novel software CC system design:
  - Exploit both private memory and shared memory efficiently (selectively)
  - Support two "coherence modes" (or memory models) concurrently on a per-page basis
  - o Harness non-coherent L2 caches while others can't
- Performance is 12% to 12 times better than Intel SMC.
- Three patents claimed:
  - A hybrid shared virtual memory system with adaptive page remapping for noncache-coherent many-core processors
  - 2. A proactive scope consistency protocol tailored to many-core tiled processors with programmable on-chip buffers
  - 3. A location-aware k-way hash-based distributed on-chip page directory for tiled CMPs

# Conclusion

#### • GHz game is over→ Go for Manycore

- Processors parallelism is primary method of performance improvement
- Coherency-, memory- and power-wall challenges in the 1000-core era are discussed.
- Software-managed Cache Coherence Support:
  - Transfer the burden of cache coherence from hardware to software, while preserving hardware support for remote cache accesses.
  - On-chip programmable memory like MPB enable customizable or programmable on-chip activities.
- Power efficiency is the key challenge (flops/watt) → "DON'T MOVE THE DATA!"

# **Thanks!**

### For more information:



C.L. Wang's webpage: http://www.cs.hku.hk/~clwang/