Argon2 has the following input parameters:

Message string P, which is a password for password hashing applications. It MUST have a length not greater than 2^(32)1 bytes.

Nonce S, which is a salt for password hashing applications. It MUST have a length not greater than 2^(32)1 bytes. 16 bytes is RECOMMENDED for password hashing. The salt SHOULD be unique for each password.

Degree of parallelism p determines how many independent (but synchronizing) computational chains (lanes) can be run. It MUST be an integer value from 1 to 2^(24)1.

Tag length T MUST be an integer number of bytes from 4 to 2^(32)1.

Memory size m MUST be an integer number of kibibytes from 8*p to 2^(32)1. The actual number of blocks is m', which is m rounded down to the nearest multiple of 4*p.

Number of passes t (used to tune the running time independently of the memory size) MUST be an integer number from 1 to 2^(32)1.

Version number v MUST be one byte 0x13.

Secret value K is OPTIONAL. If used, it MUST have a length not greater than 2^(32)1 bytes.

Associated data X is OPTIONAL. If used, it MUST have a length not greater than 2^(32)1 bytes.

Type y MUST be 0 for Argon2d, 1 for Argon2i, or 2 for Argon2id.
The Argon2 output, or "tag", is a string T bytes long.
Argon2 uses an internal compression function G with two 1024byte inputs, a 1024byte output, and an internal hash function H^x(), with x being its output length in bytes. Here, H^x() applied to string A is the BLAKE2b ([
BLAKE2],
Section 3.3) function, which takes (d,ll,kk=0,nn=x) as parameters, where d is A padded to a multiple of 128 bytes and ll is the length of d in bytes. The compression function G is based on its internal permutation. A variablelength hash function H' built upon H is also used. G is described in
Section 3.5, and H' is described in
Section 3.3.
The Argon2 operation is as follows.

Establish H_0 as the 64byte value as shown below. If K, X, or S has zero length, it is just absent, but its length field remains.
H_0 = H^(64)(LE32(p)  LE32(T)  LE32(m)  LE32(t) 
LE32(v)  LE32(y)  LE32(length(P))  P 
LE32(length(S))  S  LE32(length(K))  K 
LE32(length(X))  X)

Allocate the memory as m' 1024byte blocks, where m' is derived as:
m' = 4 * p * floor (m / 4p)
For p lanes, the memory is organized in a matrix B[i][j] of blocks with p rows (lanes) and q = m' / p columns.

Compute B[i][0] for all i ranging from (and including) 0 to (not including) p.
B[i][0] = H'^(1024)(H_0  LE32(0)  LE32(i))

Compute B[i][1] for all i ranging from (and including) 0 to (not including) p.
B[i][1] = H'^(1024)(H_0  LE32(1)  LE32(i))

Compute B[i][j] for all i ranging from (and including) 0 to (not including) p and for all j ranging from (and including) 2 to (not including) q. The computation MUST proceed slicewise (Section 3.4): first, blocks from slice 0 are computed for all lanes (in an arbitrary order of lanes), then blocks from slice 1 are computed, etc. The block indices l and z are determined for each i, j differently for Argon2d, Argon2i, and Argon2id.
B[i][j] = G(B[i][j1], B[l][z])

If the number of passes t is larger than 1, we repeat step 5. We compute B[i][0] and B[i][j] for all i raging from (and including) 0 to (not including) p and for all j ranging from (and including) 1 to (not including) q. However, blocks are computed differently as the old value is XORed with the new one:
B[i][0] = G(B[i][q1], B[l][z]) XOR B[i][0];
B[i][j] = G(B[i][j1], B[l][z]) XOR B[i][j].

After t steps have been iterated, the final block C is computed as the XOR of the last column:
C = B[0][q1] XOR B[1][q1] XOR ... XOR B[p1][q1]

The output tag is computed as H'^T(C).
Let V_i be a 64byte block and W_i be its first 32 bytes. Then we define function H' as follows:
if T <= 64
H'^T(A) = H^T(LE32(T)A)
else
r = ceil(T/32)2
V_1 = H^(64)(LE32(T)A)
V_2 = H^(64)(V_1)
...
V_r = H^(64)(V_{r1})
V_{r+1} = H^(T32*r)(V_{r})
H'^T(X) = W_1  W_2  ...  W_r  V_{r+1}
To enable parallel block computation, we further partition the memory matrix into SL = 4 vertical slices. The intersection of a slice and a lane is called a segment, which has a length of q/SL. Segments of the same slice can be computed in parallel and do not reference blocks from each other. All other blocks can be referenced.
slice 0 slice 1 slice 2 slice 3
___/\___ ___/\___ ___/\___ ___/\___
/ \ / \ / \ / \
+++++
     > lane 0
+++++
     > lane 1
+++++
     > lane 2
+++++
 ... ... ...  ...
+++++
     > lane p  1
+++++
J_1 is given by the first 32 bits of block B[i][j1], while J_2 is given by the next 32 bits of block B[i][j1]:
J_1 = int32(extract(B[i][j1], 0))
J_2 = int32(extract(B[i][j1], 1))
For each segment, we do the following. First, we compute the value Z as:
Z= ( LE64(r)  LE64(l)  LE64(sl)  LE64(m') 
LE64(t)  LE64(y) )
where

r:

the pass number

l:

the lane number

sl:

the slice number

m':

the total number of memory blocks

t:

the total number of passes

y:

the Argon2 type (0 for Argon2d, 1 for Argon2i, 2 for Argon2id)
Then we compute:
q/(128*SL) 1024byte values
G(ZERO(1024),G(ZERO(1024),
Z  LE64(1)  ZERO(968) )),
G(ZERO(1024),G(ZERO(1024),
Z  LE64(2)  ZERO(968) )),... ,
G(ZERO(1024),G(ZERO(1024),
Z  LE64(q/(128*SL))  ZERO(968) )),
which are partitioned into q/(SL) 8byte values X, which are viewed as X1X2 and converted to J_1=int32(X1) and J_2=int32(X2).
The values r, l, sl, m', t, y, and i are represented as 8 bytes in little endian.
If the pass number is 0 and the slice number is 0 or 1, then compute J_1 and J_2 as for Argon2i, else compute J_1 and J_2 as for Argon2d.
The value of l = J_2 mod p gives the index of the lane from which the block will be taken. For the first pass (r=0) and the first slice (sl=0), the block is taken from the current lane.
The set W contains the indices that are referenced according to the following rules:

If l is the current lane, then W includes the indices of all blocks in the last SL  1 = 3 segments computed and finished, as well as the blocks computed in the current segment in the current pass excluding B[i][j1].

If l is not the current lane, then W includes the indices of all blocks in the last SL  1 = 3 segments computed and finished in lane l. If B[i][j] is the first block of a segment, then the very last index from W is excluded.
Then take a block from W with a nonuniform distribution over [0, W) using the following mapping:
J_1 > W(1  J_1^2 / 2^(64))
To avoid floating point computation, the following approximation is used:
x = J_1^2 / 2^(32)
y = (W * x) / 2^(32)
zz = W  1  y
Then take the zzth index from W; it will be the z value for the reference block index [l][z].
The compression function G is built upon the BLAKE2bbased transformation P. P operates on the 128byte input, which can be viewed as eight 16byte registers:
P(A_0, A_1, ... ,A_7) = (B_0, B_1, ... ,B_7)
The compression function G(X, Y) operates on two 1024byte blocks X and Y. It first computes R = X XOR Y. Then R is viewed as an 8x8 matrix of 16byte registers R_0, R_1, ... , R_63. Then P is first applied to each row, and then to each column to get Z:
( Q_0, Q_1, Q_2, ... , Q_7) < P( R_0, R_1, R_2, ... , R_7)
( Q_8, Q_9, Q_10, ... , Q_15) < P( R_8, R_9, R_10, ... , R_15)
...
(Q_56, Q_57, Q_58, ... , Q_63) < P(R_56, R_57, R_58, ... , R_63)
( Z_0, Z_8, Z_16, ... , Z_56) < P( Q_0, Q_8, Q_16, ... , Q_56)
( Z_1, Z_9, Z_17, ... , Z_57) < P( Q_1, Q_9, Q_17, ... , Q_57)
...
( Z_7, Z_15, Z 23, ... , Z_63) < P( Q_7, Q_15, Q_23, ... , Q_63)
Finally, G outputs Z XOR R:
G: (X, Y) > R > Q > Z > Z XOR R
++ ++
 X   Y 
++ ++
 
>XOR<

 \ /
 ++
  R 
 ++
 
 \ /
 P rowwise
 
 \ /
 ++
  Q 
 ++
 
 \ /
 P columnwise
 
 \ /
 ++
  Z 
 ++
 
 \ /
>XOR

\ /
Permutation P is based on the round function of BLAKE2b. The eight 16byte inputs S_0, S_1, ... , S_7 are viewed as a 4x4 matrix of 64bit words, where S_i = (v_{2*i+1}  v_{2*i}):
v_0 v_1 v_2 v_3
v_4 v_5 v_6 v_7
v_8 v_9 v_10 v_11
v_12 v_13 v_14 v_15
It works as follows:
GB(v_0, v_4, v_8, v_12)
GB(v_1, v_5, v_9, v_13)
GB(v_2, v_6, v_10, v_14)
GB(v_3, v_7, v_11, v_15)
GB(v_0, v_5, v_10, v_15)
GB(v_1, v_6, v_11, v_12)
GB(v_2, v_7, v_8, v_13)
GB(v_3, v_4, v_9, v_14)
GB(a, b, c, d) is defined as follows:
a = (a + b + 2 * trunc(a) * trunc(b)) mod 2^(64)
d = (d XOR a) >>> 32
c = (c + d + 2 * trunc(c) * trunc(d)) mod 2^(64)
b = (b XOR c) >>> 24
a = (a + b + 2 * trunc(a) * trunc(b)) mod 2^(64)
d = (d XOR a) >>> 16
c = (c + d + 2 * trunc(c) * trunc(d)) mod 2^(64)
b = (b XOR c) >>> 63
The modular additions in GB are combined with 64bit multiplications. Multiplications are the only difference from the original BLAKE2b design. This choice is done to increase the circuit depth and thus the running time of ASIC implementations, while having roughly the same running time on CPUs thanks to parallelism and pipelining.